Saturday, March 14, 2009

"smart" quotes in PHP

"smart" quotes in PHP

--------------------------------------------------------------------------------

Hello all,

I've been struggling for a few days with the question of how to convert
"smart" (curly) quotes into straight quotes. I tried playing with the
htmlentities() function, but all that is doing is changing the smart
quotes into nonsense characters. I also searched the web for quite a
while and was unsuccessful in finding a solution.

What puzzles me is that doing it the other way around is simple enough.
For example, this works fine in converting a straight quote into an
"open" smart quote:

if ($content[$k] == "\"")
$content = substr($content, 0, $k) . "“" . substr
($content, $k+1, strlen($content)-$k+1);

But the other way around doesn't work. Any ideas?

Thanks,

Martin Goldman
My e-mail addresse's correct domain name is mgoldman.com.


Ads by Google
Print Buying Software
P3Expeditor Print Buying System Automation for Print Professionals www.P3Software.com
Perl Programmer Info
Find Info on Perl Programmer Access 10 Search Engines At Once. www.Info.com/PerlProgrammer
PHP Cheatsheets
Guessing sucks. VisiBone.com/PHP
Profiling for .NET
Profiling for all managed code Download free trial www.red-gate.com

#2 July 17th, 2005, 01:08 AM
Daniel Tryba
Guest Posts: n/a

Re: "smart" quotes in PHP

--------------------------------------------------------------------------------

Martin Goldman wrote:[color=blue]
> I've been struggling for a few days with the question of how to convert
> "smart" (curly) quotes into straight quotes.[/color]

Smart/curly quotes? straight quotes? What are these?
[color=blue]
> What puzzles me is that doing it the other way around is simple enough.
> For example, this works fine in converting a straight quote into an
> "open" smart quote:
>
> if ($content[$k] == "\"")
> $content = substr($content, 0, $k) . "“" . substr
> ($content, $k+1, strlen($content)-$k+1);[/color]

Funny way to do a str_replace :)

What character is represented by #147? AFAIK it's not in any characters
set I know (ASCII or ISO-8859-x). So your actual problem might be that
you are using an other encoding for the character you want to preplace
that PHP is actually using!

BTW 3rd parameter in htmlentities specifies the character set.

--

Daniel Tryba



#3 July 17th, 2005, 01:08 AM
Andy Hassall
Guest Posts: n/a

Re: "smart" quotes in PHP

--------------------------------------------------------------------------------

On Fri, 14 Nov 2003 17:42:08 GMT, Martin Goldman wrote:
[color=blue]
>I've been struggling for a few days with the question of how to convert
>"smart" (curly) quotes into straight quotes. I tried playing with the
>htmlentities() function, but all that is doing is changing the smart
>quotes into nonsense characters. I also searched the web for quite a
>while and was unsuccessful in finding a solution.[/color]

You've got to work out what character set the text is encoded in, for
starters, since 'smart quotes' exist in Microsoft's Codepage 1522 but not in
the standard ISO 8859 character sets, e.g. iso-8859-15.

In codepage 1522:

hex dec Unicode Unicode name
91 145 8216 LEFT SINGLE QUOTATION MARK
92 146 8217 RIGHT SINGLE QUOTATION MARK
93 147 8220 LEFT DOUBLE QUOTATION MARK
94 148 8221 RIGHT DOUBLE QUOTATION MARK

But in iso-8859-15, 145-148 aren't defined as printable characters; 128-159
are reserved for control characters.

So if you change it to “, but output your page encoded in iso-8859-1,
you're just changing it to the code for a non-printable character. The same
entity will appear as a left double quotation mark if encoded in Windows-1522
though.
[color=blue]
>What puzzles me is that doing it the other way around is simple enough.
>For example, this works fine in converting a straight quote into an
>"open" smart quote:
>
> if ($content[$k] == "\"")
> $content = substr($content, 0, $k) . "“" . substr
>($content, $k+1, strlen($content)-$k+1);
>
>But the other way around doesn't work. Any ideas?[/color]

In what way doesn't it work? What does str_replace($content, chr(147), '"');
appear to do in your setup?

--
Andy Hassall (andy@andyh.co.uk) icq(5747695) (http://www.andyh.co.uk)
Space: disk usage analysis tool (http://www.andyhsoftware.co.uk/space)


#4 July 17th, 2005, 01:08 AM
John Dunlop
Guest Posts: n/a

Re: "smart" quotes in PHP

--------------------------------------------------------------------------------

Martin Goldman wrote:
[color=blue]
> I've been struggling for a few days with the question of how to convert
> "smart" (curly) quotes into straight quotes.[/color]

As D. Tryba hinted at, str_replace should work fine. After all,
you're replacing one character with another.

$string = str_replace($chr,'"',$string)

where $chr is the character you want to replace.
[color=blue]
> I tried playing with the htmlentities() function, but all that is doing
> is changing the smart quotes into nonsense characters.[/color]

I'd be interested in seeing what you actually tried. Since so-called
smart quotes aren't in the Latin-1 repertoire, you'd have to specify
a charset other than the default ISO-8859-1. Say you typed smart
quotes on a bog standard Windows system by holding down Alt and
pressing 0, 1, 4, and 7 (or 8) on the numeric keypad, you'd use

$string = htmlentities($string,ENT_COMPAT,'cp1252')

where $string is the string containing smart quotes. That converts
smart quotes to their respective entity references.
[color=blue]
> What puzzles me is that doing it the other way around is simple enough.[/color]

Eek! I'd have thought that was *more* difficult...
[color=blue]
> if ($content[$k] == "\"")
> $content = substr($content, 0, $k) . "“" . substr
> ($content, $k+1, strlen($content)-$k+1);[/color]

How does your script know that the quotation mark was intended as an
opening quotation mark? ;-)

In HTML, the character reference “ is undefined. The LEFT DOUBLE
QUOTATION MARK can be represented using the character reference
“ or the entity reference “. The RIGHT DOUBLE QUOTATION
MARK can be represented using the character reference ” or the
entity reference ”.

--
Jock


#5 July 17th, 2005, 01:08 AM
Martin Goldman
Guest Posts: n/a

Re: "smart" quotes in PHP

--------------------------------------------------------------------------------

John Dunlop wrote in
news:MPG.1a1f806fb5038c649897c5@news.freeserve.net :
[color=blue]
> Martin Goldman wrote:[/color]
[color=blue]
> I'd be interested in seeing what you actually tried. Since so-called
> smart quotes aren't in the Latin-1 repertoire, you'd have to specify
> a charset other than the default ISO-8859-1. Say you typed smart
> quotes on a bog standard Windows system by holding down Alt and
> pressing 0, 1, 4, and 7 (or 8) on the numeric keypad, you'd use
>
> $string = htmlentities($string,ENT_COMPAT,'cp1252')
>
> where $string is the string containing smart quotes. That converts
> smart quotes to their respective entity references.
>[/color]
This results in the smart quotes being replaced with nonsense characters.
The thing is, though, that I'm totally unfamiliar with character sets,
the differences between them, etc. I've never had any reason to care
about them. So I'm a little confused about what you guys are talking
about when it comes to them.
[color=blue]
> How does your script know that the quotation mark was intended as an
> opening quotation mark? ;-)[/color]
Well, I didn't paste the whole thing. :) I wrote a loop that goes through
the string. It toggles a flag each time a quotation mark is found. If the
flag is set, it makes it an open quote; if it's not, it makes it a closed
quote. Hence the reason I'm not just using a str_replace for that. :)

Oh, and to answer Mr. Hassall's question -- str_replace(chr(147), "\"",
$content) doesn't do anything. The exact same string is returned.

-Martin


#6 July 17th, 2005, 01:09 AM
Daniel Tryba
Guest Posts: n/a

Re: "smart" quotes in PHP

--------------------------------------------------------------------------------

Martin Goldman wrote:
[consufed about charsets][color=blue]
> Oh, and to answer Mr. Hassall's question -- str_replace(chr(147), "\"",
> $content) doesn't do anything. The exact same string is returned.[/color]

That might mean that there is nog chr(147) in the string although you
_see_ a character that might be represented as the character you know as
147 in cp1252! Another fine example is the eurosymbol, IIRC its 128 in
cp1252 and 204 in iso-8859-15, in iso-8859-1 204 is a generic symbol and
totally lacks the eurosymbol. Thats why if you want to display the uero
symbol one is encouraged to use the htmlentitie €, which can be
rendered in any font and any character set (with a fallback to EUR).

So you job is to figure out how you quote is encoded (just step through
the string and print the chr value for each character)...

BTW unicode kind of solves the problem by defining every known character
in one set, the problem is that not every program supports it yet. But
unicode also introduces an other problem, the way the characters are
encoded (eg utf7, utf8, utf16...), I don't know if PHP supports utf16+.

--

Daniel Tryba



#7 July 17th, 2005, 01:09 AM
Martin Goldman
Guest Posts: n/a

Re: "smart" quotes in PHP

--------------------------------------------------------------------------------

Daniel Tryba wrote in news:bp5nhq$d0e$1
@news.tue.nl:
[color=blue]
> That might mean that there is nog chr(147) in the string although you
> _see_ a character that might be represented as the character you know[/color]
as[color=blue]
> 147 in cp1252! Another fine example is the eurosymbol, IIRC its 128 in
> cp1252 and 204 in iso-8859-15, in iso-8859-1 204 is a generic symbol[/color]
and[color=blue]
> totally lacks the eurosymbol. Thats why if you want to display the uero
> symbol one is encouraged to use the htmlentitie €, which can be
> rendered in any font and any character set (with a fallback to EUR).
>
> So you job is to figure out how you quote is encoded (just step through
> the string and print the chr value for each character)...[/color]
Interesting you should suggest this, because I just did that. And indeed,
it's not coming out as 147. It's coming out as 226, followed by 128,
followed by 156. I suppose I could do a str_replace for these 3
characters and replace it with 147. Although, then I'd have to do that
for every character I want to support. What a drag.

Thanks,
Martin


#8 July 17th, 2005, 01:09 AM
Andy Hassall
Guest Posts: n/a

Re: "smart" quotes in PHP

--------------------------------------------------------------------------------

On Sat, 15 Nov 2003 19:57:14 GMT, Martin Goldman wrote:
[color=blue]
>Daniel Tryba wrote in news:bp5nhq$d0e$1
>@news.tue.nl:
>[color=green]
>> That might mean that there is nog chr(147) in the string although you
>> _see_ a character that might be represented as the character you know
>> as 147 in cp1252! Another fine example is the eurosymbol, IIRC its 128 in
>> cp1252 and 204 in iso-8859-15, in iso-8859-1 204 is a generic symbol
>> and totally lacks the eurosymbol. Thats why if you want to display the uero
>> symbol one is encouraged to use the htmlentitie €, which can be
>> rendered in any font and any character set (with a fallback to EUR).
>>
>> So you job is to figure out how you quote is encoded (just step through
>> the string and print the chr value for each character)...[/color]
>
>Interesting you should suggest this, because I just did that. And indeed,
>it's not coming out as 147. It's coming out as 226, followed by 128,
>followed by 156. I suppose I could do a str_replace for these 3
>characters and replace it with 147. Although, then I'd have to do that
>for every character I want to support. What a drag.[/color]

Your text is encoded in UTF-8. Going back to the characters again:

hex dec Unicode Unicode name
91 145 8216 LEFT SINGLE QUOTATION MARK
92 146 8217 RIGHT SINGLE QUOTATION MARK
93 147 8220 LEFT DOUBLE QUOTATION MARK
94 148 8221 RIGHT DOUBLE QUOTATION MARK

226,128,147 in binary is:

11100010
10000000
10011100

'1110' in the first few bits of the first byte indicates it is a lead byte for
a three-byte character. The remaining two are trail bytes, as they start with
10. So separating out the data gets:

1110 0010
10 000000
10 011100

=> 0010000000011100 (binary)
= 8220 (decicmal)

Which is LEFT DOUBLE QUOTATION MARK.

--
Andy Hassall (andy@andyh.co.uk) icq(5747695) (http://www.andyh.co.uk)
Space: disk usage analysis tool (http://www.andyhsoftware.co.uk/space)


#9 July 17th, 2005, 01:12 AM
Daniel Tryba
Guest Posts: n/a

Re: "smart" quotes in PHP

--------------------------------------------------------------------------------

Andy Hassall wrote:[color=blue][color=green][color=darkred]
>>> So you job is to figure out how you quote is encoded (just step through
>>> the string and print the chr value for each character)...[/color]
>>
>>Interesting you should suggest this, because I just did that. And indeed,
>>it's not coming out as 147. It's coming out as 226, followed by 128,
>>followed by 156. I suppose I could do a str_replace for these 3
>>characters and replace it with 147. Although, then I'd have to do that
>>for every character I want to support. What a drag.[/color]
>
> Your text is encoded in UTF-8. Going back to the characters again:[/color]
[in depth UTF-8 decoding :)]

So Martin, you should take a look at iconv or if your server lacks
support utf8_decode(). The latter has also a usercontrib on how to use
str_replace on UTF-8 encoded string.

--

Daniel Tryba



#10 July 17th, 2005, 01:16 AM
Martin Goldman
Guest Posts: n/a

Re: "smart" quotes in PHP

--------------------------------------------------------------------------------

Daniel Tryba wrote in
news:bpee7i$5fr$2@news.tue.nl:
[color=blue]
> Andy Hassall wrote:[/color]
[color=blue]
> So Martin, you should take a look at iconv or if your server lacks
> support utf8_decode(). The latter has also a usercontrib on how to use
> str_replace on UTF-8 encoded string.
>[/color]

Great. Thanks to everyone to replied.

-Martin
my correct domain name is mgoldman.com

No comments: