星期二, 12月 27, 2011

[Java] Regex replace matcher

專案有一個需求需要將找到的字元進行相關取代的處理,
由於每一個找到的字元需要經過decode的處理,所以不適用replaceAll方法來處理。
以下找到符合的處理方法:

Reference:http://stackoverflow.com/questions/5568081/regex-replace-all-ignore-case

Avoid ruining the original capitalization:

In the above approach however, you're ruining the capitalization of the replaced word. Here is a better suggestion:
String inText="Sony Ericsson is a leading company in mobile. " +
              "The company sony ericsson was found in oct 2001";
String word = "sony ericsson";
Pattern p = Pattern.compile(word, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(inText);
StringBuffer sb = new StringBuffer();
while (m.find()) {
  String replacement = m.group().replace(' ', '~');
  m.appendReplacement(sb, Matcher.quoteReplacement(replacement));
}
m.appendTail(sb);
String outText = sb.toString();
System.out.println(outText);
Output:
Sony~Ericsson is a leading company in mobile.
The company sony~ericsson was found in oct 2001

星期四, 12月 15, 2011

星期二, 12月 13, 2011

[Alfresco] Search API


Search API Example:
  var def =
  {
     query: "cm:name:test*",
     language: "fts-alfresco"
  };
  var results = search.query(def);

Parameters
search
{
   query: string,          mandatory, in appropriate format and encoded for the given language
   store: string,          optional, defaults to 'workspace://SpacesStore'
   language: string,       optional, one of: lucene, xpath, jcr-xpath, fts-alfresco - defaults to 'lucene'
   templates: [],          optional, Array of query language template objects (see below) - if supported by the language 
   sort: [],               optional, Array of sort column objects (see below) - if supported by the language
   page: object,           optional, paging information object (see below) - if supported by the language
   namespace: string,      optional, the default namespace for properties
   defaultField: string,   optional, the default field for query elements when not explicit in the query
   onerror: string         optional, result on error - one of: exception, no-results - defaults to 'exception'
}

sort
{
   column: string,         mandatory, sort column in appropriate format for the language
   ascending: boolean      optional, defaults to false
}

page
{
   maxItems: int,          optional, max number of items to return in result set
   skipCount: int          optional, number of items to skip over before returning results
}

template
{
   field: string,          mandatory, custom field name for the template
   template: string        mandatory, query template replacement for the template
}

Reference:
Full Text Search Query Syntax
Alfresco.util.DataTable and search maxItems 

What is the maximum length of a URL?

http://www.boutell.com/newfaq/misc/urllength.html

[Java] DataOutputStream 的 writeBytes(String s) 編碼問題!!

Java編碼紀綠:

java 的DataOutputStream 的 writeBytes(String s) 方法對中文編碼會錯誤


public final void writeBytes(String s) throws IOException {

int len = s.length();

for (int i = 0 ; i < len ; i++) {

out.write((byte)s.charAt(i));

}

incCount(len);

}


举个例子,以字符串"你好"作为参数输入,(byte)s.charAt(i) 这句就会导致问题,
因为java里的char类型是16位的,一个char可以存储一个中文字符,在将其转换为 byte后高8位会丢失,
这样就无法将中文字符完整的输出到输出流中。
所以在可能有中文字符输出的地方最好先将其转换为字节数组,然再通过write(byte[] b)方法输出。例:

String s = "你好";

write(s.getBytes());

注意:getBytes沒指定編碼格式的話是使用預設系統的編碼。

2012/02/09更新:
DataOutputStream模擬form post上傳的時候,改用write方法才能解決中文編碼的問題。

  /*
  --boundary\r\n
  Content-Disposition: form-data; name=""; filename=""\r\n
  Content-Type: \r\n
  \r\n
  \r\n
  */ 
  this.dataOutputStream.writeBytes(this.PREFIX);
  this.dataOutputStream.writeBytes(this.boundary);
  this.dataOutputStream.writeBytes(this.CRLF);
  this.dataOutputStream.writeBytes("Content-Disposition: form-data; name=\"");
//  this.dataOutputStream.writeBytes(fieldName);
  this.dataOutputStream.write(fieldName.getBytes());
  this.dataOutputStream.writeBytes("\"; filename=\"");
  //don't support char in Chinese
//  this.dataOutputStream.writeBytes(fileName);
  this.dataOutputStream.write(fileName.getBytes());
  this.dataOutputStream.writeBytes("\"");
  this.dataOutputStream.writeBytes(this.CRLF);
  if(mimeType != null){
   this.dataOutputStream.writeBytes("Content-Type:");
   this.dataOutputStream.writeBytes(mimeType);
   this.dataOutputStream.writeBytes(this.CRLF);
   this.dataOutputStream.writeBytes(this.CRLF);
  }



2012/03/12 更新
今天在別的case之下竟然會讓中文亂碼,所以使用getBytes方法前還是指定你要的編碼比較合適
getBytes(Charset.forName("utf-8"))

Reference:
java 的DataOutputStream 的 writeBytes(String s) 方法在向
java String.getBytes()的問題

星期一, 12月 12, 2011

[Java] InputStreamReader

An InputStream is a binary stream, so there is no encoding. 
When you create the Reader, you need to know what character encoding to use, and that would depend on what the program you called produces (Java will not convert it in any way).


If you do not specify anything for InputStreamReader, it will use the platform default encoding, which may not be appropriate. There is another constructor that allows you to specify the encoding.


If you know what encoding to use (and you really have to know):

new InputStreamReader(process.getInputStream(), "UTF-8") // for example

[Java] 設定verbose參數顯示java application的詳細資訊

How to use verbose option while running a Java application
verbose option displays whole information while running a java application.
There are also some extensions available:
-verbose:class print information about each class loaded.
-verbose:gc Displays each garbage collection event.
-verbose:jni Report native methods used in application


設定步驟
步驟一:
$vi cataout.sh

步驟二:在JAVA_OPTS加入verbose參數

JAVA_OPTS='-server -Xms1024m -Xmx1024m -XX:MaxNewSize=256m -XX:MaxPermSize=256m -XX:+CMSParallelRemarkEnabled -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing -XX:+UseConcMarkSweepGC -verbose'


星期三, 12月 07, 2011

[Lucene] Stopwords

Lucene不支援以下的stopwords做搜尋:

"a", "an", "and", "are", "as", "at", "be", "but", "by",


"for", "if", "in", "into", "is", "it",


"no", "not", "of", "on", "or", "such",


"that", "the", "their", "then", "there", "these",


"they", "this", "to", "was", "will", "with"

Regex special characters


The patterns used in RegExp can be very simple, or very complicated, depending on what you're trying to accomplish. To match a simple string like "Hello World!" is no harder then actually writing the string, but if you want to match an e-mail address or html tag, you might end up with a very complicated pattern that will use most of the syntax presented in the table below.
PatternDescription
Escaping
\Escapes special characters to literal and literal characters to special.

E.g: /\(s\)/ matches '(s)' while /(\s)/ matches any whitespace and captures the match.
Quantifiers
{n}{n,}{n,m}*+?Quantifiers match the preceding subpattern a certain number of times. The subpattern can be a single character, an escape sequence, a pattern enclosed by parentheses or a character set.

{n} matches exactly n times.
{n,} matches n or more times.
{n,m} matches n to m times.
* is short for {0,}. Matches zero or more times.
+ is short for {1,}. Matches one or more times.
? is short for {0,1}. Matches zero or one time.

E.g: /o{1,3}/ matches 'oo' in "tooth" and 'o' in "nose".
Pattern delimiters
(pattern)(?:pattern)Matches entire contained pattern.

(pattern) captures match.
(?:pattern) doesn't capture match

E.g: /(d).\1/ matches and captures 'dad' in "abcdadef" while /(?:.d){2}/ matches but doesn't capture 'cdad'.

Note: (?:pattern) is a JavaScript 1.5 feature.
Lookaheads
(?=pattern)(?!pattern)A lookahead matches only if the preceding subexpression is followed by the pattern, but the pattern is not part of the match. The subexpression is the part of the regular expression which will be matched.

(?=pattern) matches only if there is a following pattern in input.
(?!pattern) matches only if there is not a following pattern in input.

E.g: /Win(?=98)/ matches 'Win' only if 'Win' is followed by '98'.

Note: Lookahead is a JavaScript1.5 feature.
Alternation
|Alternation matches content on either side of the alternation character.

E.g: /(a|b)a/ matches 'aa' in "dseaas" and 'ba' in "acbab".
Character sets
[characters][^characters]Matches any of the contained characters. A range of characters may be defined by using a hyphen.

[characters] matches any of the contained characters.
[^characters] negates the character set and matches all but the contained characters

E.g: /[abcd]/ matches any of the characters 'a', 'b', 'c', 'd' and may be abbreviated to /[a-d]/. Ranges must be in ascending order, otherwise they will throw an error. (E.g: /[d-a]/ will throw an error.)
/[^0-9]/ matches all characters but digits.

Note: Most special characters are automatically escaped to their literal meaning in character sets.
Special characters
^$.? and all the highlighted characters above in the table.Special characters are characters that match something else than what they appear as.

^ matches beginning of input (or new line with m flag).
$ matches end of input (or end of line with m flag).
. matches any character except a newline.
? directly following a quantifier makes the quantifier non-greedy (makes it match minimum instead of maximum of the interval defined).

E.g: /(.)*?/ matches nothing or '' in all strings.

Note: Non-greedy matches are not supported in older browsers such as Netscape Navigator 4 or Microsoft Internet Explorer 5.0.
Literal characters
All characters except those with special meaning.Mapped directly to the corresponding character.

E.g: /a/ matches 'a' in "Any ancestor".
Backreferences
\nBackreferences are references to the same thing as a previously captured match. n is a positive nonzero integer telling the browser which captured match to reference to.

/(\S)\1(\1)+/g matches all occurrences of three equal non-whitespace characters following each other.
/<(\S+).*>(.*)<\/\1>/ matches any tag.

E.g: /<(\S+).*>(.*)<\/\1>/ matches '
text
' in "text
text
text".
Character Escapes
\f\r\n\t\v\0[\b]\s,\S\w\W\d\D\b\B\cX,\xhh\uhhhh\f matches form-feed.
\r matches carriage return.
\n matches linefeed.
\t matches horizontal tab.
\v matches vertical tab.
\0 matches NUL character.
[\b] matches backspace.
\s matches whitespace (short for [\f\n\r\t\v\u00A0\u2028\u2029]).
\S matches anything but a whitespace (short for [^\f\n\r\t\v\u00A0\u2028\u2029]).
\w matches any alphanumerical character (word characters) including underscore (short for [a-zA-Z0-9_]).
\W matches any non-word characters (short for [^a-zA-Z0-9_]).
\d matches any digit (short for [0-9]).
\D matches any non-digit (short for [^0-9]).
\b matches a word boundary (the position between a word and a space).
\B matches a non-word boundary (short for [^\b]).
\cX matches a control character. E.g: \cm matches control-M.
\xhh matches the character with two characters of hexadecimal code hh.
\uhhhh matches the Unicode character with four characters of hexadecimal code hhhh.


Reference:
Regular Expressions patterns

[Alfresco] Lucene Search: Escaping special characters


今天測試開了特殊字元的帳號查看垃圾桶的檔案,在alfresco的web ui也爆炸了。
發現UID含特殊字元時,也要記得跳脫!!
you are using Lucene 1.4 or prior, there is no escape convenience utility. Instead, you must write your own. The characters that need to be escaped are: + - ! ( ) { } [ ] ^ " ~ * ? : \
Lucene 1.4 Escaping (More Complete)
// Some constants.
private static final String LUCENE_ESCAPE_CHARS = "[\\\\+\\-\\!\\(\\)\\:\\^\\]\\{\\}\\~\\*\\?]";
private static final Pattern LUCENE_PATTERN = Pattern.compile(LUCENE_ESCAPE_CHARS);
private static final String REPLACEMENT_STRING = "\\\\$0";
 
// ... Then, in your code somewhere...
String userInput = // ...
String escaped = LUCENE_PATTERN.matcher(userInput).replaceAll(REPLACEMENT_STRING);
Query query = QueryParser.parse(escaped);
// ...


Reference:
Lucene: Escaping Special Characters

星期一, 12月 05, 2011

[Javascript] Characters to escape in JSON

 $("#exec").click(function(){
  var newPW = $("#newPW").val();
  console.log("old newPW:" + newPW);
  var testPW = 
  {
    test:newPW
  }
  //escape special charactors
  console.log("JSON.stringify:" + JSON.stringify(testPW));
 });
Reference:
Characters to escape in JSON

星期三, 11月 30, 2011

[Javascript] how to get local file path from fileinput using IE8 or IE9

http://stackoverflow.com/questions/5753442/how-can-i-get-the-local-filepath-from-the-fileinput-using-javascript-in-ie9



要達到這個效果需要IE用戶開啟二項設定:

  • ie8 and ie9 必需請用戶開啟允許使用ActiveXObject 

工具->網際網路選項->安全性->自訂等級->ActiveX控制項與外掛程式->將未標示成安全的ActiveX控制項初始化並執行指令碼設定為提示(安全性考量不要設為啟用)

  • ie9必需再設定允許上傳檔案存取本機路徑(不然file upload會顯示fakepath)
工具->網際網路選項->安全性->自訂等級->雜項->將檔案上傳到伺服器包括本機路徑設為啟用

星期六, 11月 26, 2011

星期四, 11月 24, 2011

Special Characters Supported for Passwords

Name of the CharacterCharacter
at sign@
percent sign%
plus sign+
backslash\
slash/
single quotation mark'
exclamation point!
number sign#
dollar sign$
caret^
question mark?
colon:
comma.
left parenthesis(
right parenthesis)
left brace{
right brace}
left bracket[
right bracket]
tilde~
grave accent
This character is also known as the backquote character.
The grave accent cannot be reproduced in this document.
hyphen-
underscore_

Reference:

星期日, 11月 20, 2011

URL decode/encode 觀念題

做網頁傳遞中文時常會用到URL decode/encode,
混亂的 URLEncode 說明了為什麼要使用URLEncode,
有興趣可以去讀一下。

[[Javascript 茶包筆記]] 小數點運算

JavaScript 要取到小數點下的指定位數,要四捨五入時有內建的toFixed()函數可使用,

例:
var num = new Number(13.3714);
document.write(num.toFixed());
document.write(num.toFixed(1));
document.write(num.toFixed(3));
document.write(num.toFixed(10));

結果:
13
13.4
13.371
13.3714000000

星期四, 11月 03, 2011

[Javascript] URL decode encode

使用Javascript來做URL編碼,請依需求來評估使用哪一種方法!!

Javascript escape, encodeURI, encodeURIComponent Encode後的結果,整理如下表:(請參考這裡)
文字類型英文數字中文Unescaped charactersReserved charactersScore
原始字串AZaz01-_.!~*'();,/?:@&=+$#
escape後AZaz01%u5803-_.%21%7E*%27%28%29%3B%2C/%3F%3A@%26%3D+%24%23
encodeURI後AZaz01%E5%A0%83-_.!~*'();,/?:@&=+$#
encodeURI
Component後
AZaz01%E5%A0%83-_.!~*'()%3B%2C%2F%3F%3A%40%26%3D%2B%24%2

其他你感興趣的文章

Related Posts with Thumbnails