외환 성남시: 낮은 대기 시간 거래 시스템 아키텍처

페이지를 찾을 수 없습니다.

다음을 시도하십시오.

브라우저의 주소 표시 줄에 표시된 웹 사이트 주소의 철자와 형식이 올바른지 확인하십시오. 링크를 클릭하여이 페이지에 도달 한 경우 웹 사이트 관리자에게 연락하여 링크의 형식이 잘못되었음을 알리십시오. 다른 링크를 시도하려면 뒤로 버튼을 클릭하십시오.

HTTP 오류 404 - 파일 또는 디렉토리를 찾을 수 없습니다.

인터넷 정보 서비스 (IIS)

기술 정보 (지원 인력 용)

Microsoft 고객 기술 지원부로 이동하여 HTTP 및 404 단어의 제목 검색을 수행하십시오. IIS 관리자 (inetmgr)에서 액세스 할 수있는 IIS 도움말을 열고 웹 사이트 설정, 일반 관리 작업 및 사용자 지정 오류 메시지 정보 항목을 검색하십시오.

저 대기 시간 시스템에 대한 11 가지 우수 사례.

Google이 트래픽이 500 % 감소하고 트래픽이 20 % 감소했으며 Amazon은 100ms의 추가 대기 시간으로 인해 매출이 1 % 감소했습니다. 그 이후로 개발자들은 대기 시간 곡선의 맨 아래를 경주하며 프론트 엔드 개발자들이 자바 스크립트, CSS, 심지어는 HTML까지 모든 마지막 밀리 초를 압도하고 있습니다. 다음은 대기 시간이 짧은 시스템을 설계 할 때 명심해야 할 다양한 모범 사례를 통해 무작위로 진행됩니다. 이러한 제안의 대부분은 논리적 인 극단으로 옮겨 지지만 물론 상충 관계가 생길 수 있습니다. (익명의 사용자에게 Quora에 대한 질문을하고 내 생각을 서면으로 작성해 주신 덕분에).

올바른 언어를 선택하십시오.

스크립팅 언어는 적용 할 필요가 없습니다. 계속 빨라지고 빠르지 만 처리 시간에서 지난 몇 밀리 초를 면도 할 때 인터프리터 언어의 오버 헤드를 가질 수 없습니다. 또한, Java, Scala, C ++ 11 또는 Go를보고 있어야하므로 잠금없는 프로그래밍을 가능하게하는 강력한 메모리 모델이 필요합니다.

그것을 모두 기억하십시오.

I / O는 대기 시간을 줄이므로 모든 데이터가 메모리에 있는지 확인하십시오. 이것은 일반적으로 자신의 메모리 내 데이터 구조를 관리하고 지속적인 로그를 유지하는 것을 의미하므로 기계 또는 프로세스를 다시 시작한 후에 상태를 다시 작성할 수 있습니다. 영속적 인 로그를위한 몇몇 옵션은 Bitcask, Krati, LevelDB와 BDB-JE를 포함합니다. 또는 redis 또는 MongoDB (메모리 & gt; & gt; 데이터 포함)와 같이 로컬에 저장된 지속적인 메모리 내 데이터베이스를 실행할 수 있습니다. 백그라운드 동기화로 인해 충돌시 일부 데이터가 손실 될 수 있습니다.

데이터와 처리를 동일하게 유지하십시오.

네트워크 홉은 디스크 검색보다 빠르지 만 여전히 많은 오버 헤드가 추가됩니다. 이상적으로는 데이터가 한 호스트의 메모리에 완전히 들어 있어야합니다. AWS가 클라우드에 거의 1/4 TB의 RAM을 제공하고 여러 TB를 제공하는 물리적 서버가 일반적으로 가능합니다. 둘 이상의 호스트에서 실행해야하는 경우 주어진 요청을 서비스하는 데 필요한 모든 데이터가 로컬에서 사용 가능하도록 데이터와 요청이 적절히 분할되어 있는지 확인해야합니다.

시스템을 충분히 활용하지 마십시오.

대기 시간이 짧으면 항상 요청을 처리 할 수있는 리소스가 있어야합니다. 하드웨어 / 소프트웨어가 제공 할 수있는 범위 내에서 실행하려고 시도하지 마십시오. 항상 파열을위한 헤드 룸을 많이 가지고 있습니다.

컨텍스트 스위치를 최소로 유지하십시오.

컨텍스트 스위치는 리소스가있는 것보다 더 많은 계산 작업을 수행한다는 신호입니다. 스레드 수를 시스템의 코어 수로 제한하고 각 스레드를 자체 코어에 고정하려는 것이 좋습니다.

계속 읽기를 유지하십시오.

순차적으로 사용하면 모든 유형의 스토리지가 회전, 플래시 기반 또는 메모리가 크게 향상됩니다. 순차 읽기를 메모리로 발행 할 때 RAM 레벨뿐만 아니라 CPU 캐시 레벨에서도 프리 페치 사용을 트리거합니다. 제대로 완료되면 필요로하는 다음 데이터 조각은 필요할 때마다 항상 L1 캐시에 저장됩니다. 이 프로세스를 돕는 가장 쉬운 방법은 원시 데이터 형식 또는 구조체의 배열을 많이 사용하는 것입니다. 포인터 다음에는 링크 된 목록을 사용하거나 객체 배열을 사용하여 피해야합니다.

일기 쓰기를 배치하십시오.

반 직관적 인 것처럼 들리지만 일괄 쓰기로 인해 성능이 크게 향상 될 수 있습니다. 그러나 이것은 시스템이 쓰기를 수행하기 전에 임의의 시간 동안 대기해야한다는 오해가 있습니다. 대신 하나의 스레드는 I / O를 수행하는 긴밀한 루프에서 회전해야합니다. 각 쓰기는 최종 쓰기가 발행 된 이후에 도착한 모든 데이터를 일괄 처리합니다. 이것은 매우 빠르고 적응력있는 시스템을 만듭니다.

캐시를 존중하십시오.

이러한 최적화가 모두 이루어지면 메모리 액세스가 신속하게 병목 현상이됩니다. 스레드를 자체 코어에 고정하면 CPU 캐시 오염을 줄이고 순차 I / O는 캐시를 미리로드하는 데 도움이됩니다. 그 외에도 기본 데이터 유형을 사용하여 메모리 크기를 줄여야하므로 더 많은 데이터가 캐시에 저장됩니다. 또한 데이터를 캐시에 저장하고 필요한 처리를 수행 할 때까지 반복적으로 데이터를 분할하여 작동하는 캐시 룰 알리지 않는 알고리즘을 조사 할 수 있습니다.

가능한 한 많이 차단하지 않습니다.

블로킹이없는 친구를 만들고 무료 데이터 구조와 알고리즘을 기다리십시오. 자물쇠를 사용할 때마다 거대한 오버 헤드 인 자물쇠를 중재하기 위해 OS에 스택을 내려야합니다. 종종 자신이 무엇을하고 있는지 알면 JVM, C ++ 11 또는 Go의 메모리 모델을 이해하여 잠금을 해제 할 수 있습니다.

최대한 비동기.

모든 처리 및 특히 응답 작성에 절대적으로 필요하지 않은 I / O는 중요한 경로 외부에서 수행해야합니다.

가능한 한 병렬 처리하십시오.

모든 처리와 특히 병렬로 발생할 수있는 I / O는 병렬로 수행해야합니다. 예를 들어 고 가용성 전략에 트랜잭션을 디스크에 로깅하고 트랜잭션을 보조 서버로 전송하는 경우 이러한 작업이 병렬로 발생할 수 있습니다.

거의 모든 것은 LMAX가 Disruptor 프로젝트를 통해 무엇을하고 있는지에 달려 있습니다. 그걸 읽고 Martin Thompson이하는 일을 따르십시오.

이 공유:

관련.

에 의해 출판 됨.

벤자민 다르 플러.

저 지연 시스템을위한 11 가지 우수 사례 & rdquo;

그리고 귀하의 목록에 기쁘게 🙂

좋은 기사. 하나의 쇠고기 : Java 또는 C ++ 11과 같은 정교한 메모리 모델을 가지고 있지 않습니다. 귀하의 시스템이 고 루틴 및 채널 아키텍처에 적합하다면 운이 좋으면 좋습니다. AFAIK 런타임 스케줄러에서 옵트 아웃 할 수 없으므로 원시 OS 스레드 및 (SPSC 대기열 / 링 버퍼)와 같은 자체 잠금없는 데이터 구조를 빌드 할 수있는 기능도 심각하게 부족합니다.

답장을 보내 주셔서 감사합니다. Go 메모리 모델 (golang. org/ref/mem)이 Java 나 C ++ 11만큼 강력하지는 않지만, 여전히 잠금없는 데이터 구조를 사용하여이를 관리 할 수 있다는 인상하에있었습니다. 예를 들어 github / textnode / gringo, github / scryner / lfreequeue 및 github / mocchira / golfhash. 어쩌면 나는 뭔가를 놓친 것일까? 틀림없이 JVM보다 Go에 대한 정보가 훨씬 적다.

Benjamin, Go 메모리 모델은 여기에 자세히 설명되어 있습니다 : golang. org/ref/mem은 주로 채널과 mutex의 관점입니다. 목록에있는 패키지를 살펴 보았고 데이터 구조에는 & nbsp; 잠기지 않았습니다. & nbsp; Java / C ++ 11에서 빌드 한 것과 동일하지 않습니다. 현재 동기화 패키지는 완화 된 원자 또는 C ++ 11의 획득 / 릴리스 의미에 대한 지원을 제공하지 않습니다. 이러한 지원이 없으면 SPSC 데이터 구조를 C ++ / Java에서 가능한 것만 큼 효율적으로 구축하기가 어렵습니다. 링크 한 프로젝트는 원자를 사용합니다. 추가 & # 8230; 이것은 순차적으로 일관된 원자이다. XADD로 제작되었으므로 & # 8211; github / tonnerre / golang / blob / master / src / pkg / sync / atomic / asm_amd64.s.

나는 아래로 두드리는 것을 시도하고 있지 않다. 비동기 IO 및 동시 쓰기를위한 최소한의 노력이 필요합니다.

코드는 대부분의 사람들에게 충분히 빠릅니다. std 라이브러리 역시 성능을 위해 매우 잘 조정되어 있습니다. Golang은 또한 Java에서 빠진 구조체를 지원합니다. 그러나 단순한 메모리 모델과 일상적인 런타임은 여러분이 말하는 시스템의 종류를 구축하는 방식에 서 있다고 생각합니다.

깊은 답변을 주셔서 감사합니다. 나는 사람들이 이것을 앞뒤로 유용하게 찾길 바란다.

& # 821; 기본 & # 8217; 언어가 더 좋을 수도 있습니다. 엄격하게 요구되는 것은 아닙니다. Facebook은 우리에게 PHP로 할 수 있음을 보여주었습니다. 미리 컴파일 된 PHP를 HHVM 시스템과 함께 사용한다고합니다. 그러나 그것은 가능합니다!

불행히도 PHP는 여전히 HHVM이 실행 속도를 크게 개선하더라도 수용 가능한 메모리 모델이 부족합니다.

다음 사람만큼 높은 수준의 언어를 사용하기 위해 노력하지만, 사람들이 찾고있는 저 대기 시간 앱을 얻는 유일한 방법은 C와 같은 언어로 드롭 다운하는 것입니다. 언어를 쓰는 것이 어렵습니다. 실행 속도가 빠릅니다.

필자가 링크 된 프로젝트 및 블로그에서 수행중인 작업을 살펴볼 것을 강력히 권장합니다. JVM은 강력한 메모리 모델과 가비지 컬렉션을 제공하여 약하거나 정의되지 않은 메모리 모델 및 메모리 관리를위한 참조 카운터로 거의 또는 전혀 불가능한 잠금없는 프로그래밍을 가능하게하기 때문에 이러한 유형의 시스템에 대한 핫 스폿이되고 있습니다.

나는 Benjamin을 살펴볼 것이다. 그들을 지적 주셔서 감사합니다.

잠금없는 프로그래밍을위한 가비지 콜렉션은 deus 전 machina의 비트입니다. MPMC 및 SPSC 대기열은 모두 GC가 필요없이 빌드 될 수 있습니다. 가비지 콜렉션없이 잠금없는 프로그래밍을 수행하는 방법은 많습니다. 참조 카운팅이 유일한 방법은 아닙니다. 위험 포인터, RCU, Proxy-Collector 등은 모두 지연된 교정에 대한 지원을 제공하며 일반적으로 알고리즘 (일반이 아닌)을 지원하므로 일반적으로 쉽게 빌드 할 수 있습니다. 물론 프로덕션 품질 GC는 많은 작업을 필요로하며 지연된 교정 시스템을 코딩하지 않고 덜 숙련 된 프로그래머가 자물쇠없는 알고리즘을 작성하는 데 도움을 줄 것입니다. . 이 분야에서 일하는 일부 링크 : cs. toronto. edu/

예 C / C ++가 최근에 메모리 모델을 얻었지만, 잠금 코드가없는 코드에 대해서는 완전히 부적절하다는 것을 의미하지는 않습니다. GCC 및 다른 고급 컴파일러는 컴파일 된 특정 지시문을 사용하여 정말 오랫동안 지원 플랫폼에서 잠금없는 프로그래밍을 수행했습니다. 그것은 단지 언어로 표준화되지 않았습니다. Linux 및 다른 플랫폼에서도 이러한 프리미티브를 제공했습니다. 자바의 독보적 인 위치는 공식화 된 메모리 모델을 제공했기 때문에 지원되는 모든 플랫폼에서 작동 할 수 있습니다. 원칙적으로 이것은 굉장하지만, 대부분의 서버 측 개발자는 하나의 플랫폼 (Linux / Windows)에서 작동합니다. 그들은 이미 플랫폼 용 잠금 해제 코드를 만드는 도구를 가지고있었습니다.

GC는 훌륭한 도구이지만 꼭 필요한 도구는 아닙니다. 성능 및 복잡성 측면에서 비용이 많이 든다 (STW GC를 피하기 위해 필요한 모든 트릭). C ++ 11 / C11은 이미 적절한 메모리 모델을 지원합니다. 향후 JVM이 Unsafe API를 지원할 책임이 없음을 잊지 마십시오. 안전하지 않은 코드는 & # 8220; 안전하지 않습니다 & # 8221; 따라서 Java의 안전 기능의 이점을 잃게됩니다. 마지막으로 IMO는 메모리를 레이아웃하고 Java에서 구조체를 시뮬레이트하는 데 사용되는 Unsafe 코드가 컴파일러가 신뢰할 수있는 방식으로 작업을 수행하는 C / C ++ 구조체보다 훨씬 못 생겼습니다. C 및 C ++는 또한 PAUSE ins, SSE / AVX / NEON 등과 같은 모든 저수준 플랫폼 전용 전동 공구에 대한 액세스를 제공합니다. 링커 스크립트를 통해 코드 레이아웃을 조정할 수도 있습니다! C / C ++ 툴 체인에서 제공하는 성능은 실제로 JVM과 비교할 수 없습니다. Java는 훌륭한 플랫폼이지만, 가장 일반적인 장점은 일반 비즈니스 로직 (코드의 90 %)이 여전히 GC 및 안전 기능에 의존하고 고도로 조정되고 테스트 된 라이브러리를 사용할 수 있다는 것입니다 안전하지 않은. 이는 퍼포먼스의 마지막 5 %를 얻는 것과 생산성을 높이는 것 사이의 절충점입니다. 트레이드 오프는 많은 사람들에게 타당하지만 트레이드 오프는 그다지 적지 않습니다. 복잡한 애플리케이션 코드를 C / C ++로 작성하는 것은 결국 악몽입니다.

2014 년 3 월 10 일 월요일 오후 12:52에 CodeDependents는 다음과 같이 썼습니다.

& gt; Graham Swan은 다음과 같이 언급했습니다. "나는 Benjamin을 보았습니다. & gt;에 대한 감사합니다. 그 (것)들을 지적하십시오. & # 8221;

누락 된 12 일 : garbadge 수집 언어를 사용하지 마십시오. GC는 최악의 경우 병목입니다. 모든 스레드가 중지 될 가능성이 있습니다. 그것의 글로벌. 이는 건축가가 자신에게 가장 중요한 리소스 (CPU에 가까운 메모리) 중 하나를 관리하도록 혼란스럽게 만듭니다.

실제로 많은 작업이 Java에서 직접 제공됩니다. 잠금 해제 프로그래밍을 제대로하려면 C ++가 최근에 얻은 명확한 메모리 모델이 필요합니다. GC로 작업하는 방법을 알고 있고 그렇지 않은 경우 낮은 대기 시간의 시스템을 훨씬 쉽게 만들 수 있습니다.

나는 벤과 여기에 동의해야한다. 지난 10 년 간 GC 병렬 처리에 대한 많은 진전이있었습니다. G1 수집기가 최신 주문이었습니다. 힙과 다양한 손잡이를 조정하여 GC가 거의 멈추지 않도록하려면 약간의 시간이 걸릴 수 있지만 GC가없는 개발자의 시간과 비교하면 그만한 편입니다.

한 단계 더 나아가서 GC를 조작 창 밖으로 쉽게 밀어 넣을 수 있도록 쓰레기를 거의 만들지 않는 시스템을 만들 수도 있습니다. 이것은 JVM에서 실행될 때 모든 고주파 거래 상점이이를 수행하는 방법입니다.

잠금없는 프로그래밍을위한 가비지 콜렉션은 deus 전 machina의 비트입니다. MPMC 및 SPSC 대기열은 모두 GC가 필요없이 빌드 될 수 있습니다. 가비지 콜렉션없이 잠금없는 프로그래밍을 수행하는 방법은 많습니다. 참조 카운팅이 유일한 방법은 아닙니다. 위험 포인터, RCU, Proxy-Collector 등은 모두 지연된 교정을 지원하고 알고리즘 (일반이 아닌)을 지원하므로 코딩하기가 훨씬 쉽습니다. 물론 프로덕션 품질 GC는 많은 작업을 필요로하며 지연된 교정 시스템을 코딩하지 않고 덜 숙련 된 프로그래머가 자물쇠없는 알고리즘을 작성하는 데 도움을 줄 것입니다. . 이 분야에서 일하는 일부 링크 : cs. toronto. edu/

GC는 훌륭한 도구이지만 꼭 필요한 도구는 아닙니다. 성능 및 복잡성 측면에서 비용이 많이 든다 (STW GC를 지연시키고 피하는 데 필요한 모든 트릭). C ++ 11 / C11은 이미 적절한 메모리 모델을 지원합니다. 향후 JVM이 Unsafe API를 지원할 책임이 없음을 잊지 마십시오. 안전하지 않은 코드는 & # 8220; 안전하지 않습니다 & # 8221; 따라서 Java의 안전 기능의 이점을 잃게됩니다. 마지막으로 IMO는 메모리를 레이아웃하고 Java에서 구조체를 시뮬레이트하는 데 사용되는 Unsafe 코드가 컴파일러가 신뢰할 수있는 방식으로 작업을 수행하는 C / C ++ 구조체보다 훨씬 못 생겼습니다. C 및 C ++는 또한 PAUSE ins, SSE / AVX / NEON 등과 같은 모든 저수준 플랫폼 전용 전동 공구에 대한 액세스를 제공합니다. 링커 스크립트를 통해 코드 레이아웃을 조정할 수도 있습니다! C / C ++ 툴 체인에서 제공하는 성능은 실제로 JVM과 비교할 수 없습니다. Java는 훌륭한 플랫폼이지만, 가장 일반적인 장점은 일반 비즈니스 로직 (코드의 90 %)이 여전히 GC 및 안전 기능에 의존하고 고도로 조정되고 테스트 된 라이브러리를 사용할 수 있다는 것입니다 안전하지 않은. 이는 퍼포먼스의 마지막 5 %를 얻는 것과 생산성을 높이는 것 사이의 절충점입니다. 트레이드 오프는 많은 사람들에게 타당하지만 트레이드 오프는 그다지 적지 않습니다. 복잡한 애플리케이션 코드를 C / C ++로 작성하는 것은 결국 악몽입니다.

& gt; garbadge 수집 언어를 사용하지 마십시오.

또는 적어도, 전통적 & # 8221; 가비지 수집 언어. 그들은 다르기 때문에 & # 8211; Erlang도 컬렉터를 가지고 있지만, 세계를 멈추게하지 않기 때문에 병목 현상을 일으키지 않습니다. & # 8221; 자바로 쓰레기를 모으는 중. & # 8211; 대신 작은 개별 마이크로 스레드 (micro-threads)를 중지합니다. 마이크로 초 단위로 표시되므로 큰 눈에 띄지 않습니다.

그 내용을 & # 8220; 전통적인 & # 8221; 가비지 콜렉션 [i] 알고리즘 [/ i]. LMAX에서는 Azul Zing을 사용하고 가비지 콜렉션에 대한 다른 접근법을 사용하는 다른 JVM을 사용하여 주요 GC와 보조 GC가 모두 더 저렴한 비용으로 주문을 받으면서 성능이 크게 향상되었습니다.

그 비용을 상쇄하는 다른 비용이 있습니다. 물론 많은 힙을 사용하고 Zing은 저렴하지 않습니다.

이것을 Java Prorgram의 예와 논평 :

Java 프로그래머를위한 필수 문서 중 하나 인 Java는 10 분 안에 Java에서 대기 시간이 짧은 시스템을 튜닝하고 개발하는 데 많은 시간을 투자 한 후에 배울 교훈입니다.

오래된 스레드를 되살리는, 하지만 (놀랍게도) 이것은 지적되어야합니다 :

1) 상위 레벨 언어 (예 : Java)는 하위 레벨 언어 (예 : C)에서 사용할 수없는 하드웨어에서 기능을 끌어 내지 않습니다. 그렇게 말하면서 '완전히 불가능하다'는 말을 듣는다. Java에서 쉽게 수행 할 수있는 반면 Java는 JVM이 Java에서 필요로하는 기능을 종합해야하지만 실제 하드웨어에서는 제공하지 않는 가상 하드웨어에서 Java가 실행된다는 점을 인정하지 않고 완전한 쓰레기입니다. JVM (예 : C로 작성)이 기능 X를 합성 할 수 있다면 C 프로그래머도 마찬가지입니다.

2) & # 8220; 잠금 해제 & # 8221; 싱글 코어 x86과 같은 특정 환경에서 우연히 만난 것을 제외하고는 사람들이 생각하는 바가 아닙니다. 멀티 코어 x86은 복잡하고 비용이 많이 들지 않는 메모리 장벽없이 자주 잠금을 실행할 수 없습니다. 위의 1에서와 같이 잠금 해제가 주어진 환경에서 작동하면 이는 하드웨어에 의해 지원되거나 가상 환경에서 소프트웨어 에뮬레이트 / 합성되기 때문입니다.

그레이트 포인트 줄리어스. 내가 시도한 (어쩌면 성공적이지 못한) 점은 GC에 의존하기 때문에 C에서 많은 패턴을 적용하기가 어렵다는 것입니다. 단순히 메모리 장벽을 뛰어 넘는 것입니다. 잠금 해제 및 대기 알고리즘을 사용할 때 특히 어려워지는 메모리를 해제하는 것을 고려해야합니다. 이것은 GC가 큰 승리를 가져 오는 곳입니다. 즉, 녹스는 이러한 문제 중 일부를 해결하기 시작할 수있는 메모리 소유에 대한 몇 가지 흥미로운 아이디어를 갖고 있다고 들었습니다.

LMAX 아키텍처.

LMAX는 새로운 소매 금융 거래 플랫폼입니다. 그 결과 낮은 대기 시간으로 많은 거래를 처리해야합니다. 이 시스템은 JVM 플랫폼을 기반으로하며 단일 스레드에서 초당 6 백만 건을 처리 할 수있는 비즈니스 로직 프로세서를 기반으로합니다. Business Logic Processor는 이벤트 소싱을 사용하여 완전히 메모리 내에서 실행됩니다. Business Logic Processor는 자물쇠가 필요없이 작동하는 대기열 네트워크를 구현하는 동시성 구성 요소 인 Disruptors로 둘러싸여 있습니다. 설계 프로세스 중에 팀은 대기열을 사용하는 고성능 동시성 모델의 근본적인 방향이 근본적으로 현대 CPU 설계와 확연히 다른 것으로 결론지었습니다.

지난 몇 년 동안 우리는 "무료 점심은 끝났습니다"라고 계속 들었습니다. [1] - 개별 CPU 속도의 증가는 기대할 수 없습니다. 따라서 빠른 코드를 작성하려면 명시 적으로 여러 프로세서에 동시 소프트웨어를 사용해야합니다. 이것은 좋은 소식이 아닙니다. 동시 코드 작성은 매우 어렵습니다. 잠금 장치 및 세마포는 이론적으로는 이론적으로 어렵고 테스트하기도 어렵습니다. 즉, 우리는 도메인 문제를 해결하는 것보다 컴퓨터를 만족시키는 데 더 많은 시간을 할애하고 있습니다. 액터 (Actor) 및 소프트웨어 트랜잭션 메모리 (Software Transactional Memory)와 같은 다양한 동시성 모델은이 작업을보다 쉽게하기 위해 노력하지만 버그와 복잡성을 초래하는 부담이 여전히 남아 있습니다.

그래서 저는 작년 3 월 LMAX에서 QCon London에서 있었던 이야기에 대해 듣기 위해 매료되었습니다. LMAX는 새로운 소매 금융 거래 플랫폼입니다. 비즈니스 혁신은 모든 사람이 다양한 금융 파생 상품을 거래 할 수있게하는 소매 플랫폼이라는 것입니다 [2]. 이와 같은 거래 플랫폼은 매우 낮은 대기 시간을 필요로합니다. 시장이 빠르게 움직이기 때문에 거래를 신속하게 처리해야합니다. 소매 플랫폼은 많은 사람들에게 그렇게해야하기 때문에 복잡성을 더합니다. 결과적으로 많은 거래가 발생하는 사용자가 더 많아지며 모든 거래가 신속하게 처리되어야합니다. [3]

멀티 코어 사고로의 이행을 감안할 때, 이러한 종류의 까다로운 성능은 자연스럽게 명시 적 동시성 프로그래밍 모델을 제안 할 것입니다. 실제로 이것이 출발점이었습니다. 그러나 QCon에서 사람들의 관심을 얻은 점은 이것이 결국 그들이 끝난 곳이 아니라는 것입니다. 사실 그들은 모든 플랫폼에서 모든 고객, 모든 시장의 모든 거래를 단일 스레드로 처리함으로써 모든 플랫폼의 비즈니스 로직을 수행하게되었습니다. 범용 하드웨어를 사용하여 초당 6 백만 건의 주문을 처리하는 스레드. [4]

대기 시간이 짧고 동시 코드의 복잡성이 적은 트랜잭션을 많이 처리합니다. 어떻게 그 문제를 해결할 수 있습니까? 다행스럽게도 LMAX가 다른 금융 회사와 갖는 또 다른 차이점은 기술적 결정에 관해 이야기하는 것이 매우 행복하다는 것입니다. 이제 LMAX는 잠시 동안 생산 단계에 있었고 매혹적인 디자인을 탐구 할 시간입니다.

전체 구조.

그림 1 : 3 개의 얼룩에있는 LMAX의 아키텍처.

최상위 레벨에서 아키텍처에는 세 부분이 있습니다.

비즈니스 로직 프로세서 [5]는 장애 매개 변수를 입력합니다.

이름에서 알 수 있듯이 비즈니스 로직 프로세서는 애플리케이션의 모든 비즈니스 로직을 처리합니다. 앞서 언급했듯이이 메서드는 메서드 호출에 반응하고 출력 이벤트를 생성하는 단일 스레드 Java 프로그램입니다. 따라서 JVM 자체가 아닌 다른 플랫폼 프레임 워크를 실행할 필요가없는 간단한 자바 프로그램이므로 테스트 환경에서 쉽게 실행할 수 있습니다.

Business Logic Processor는 테스트를위한 간단한 환경에서 실행될 수 있지만 프로덕션 환경에서 실행하기 위해보다 복잡한 안무가 있습니다. 입력 메시지는 네트워크 게이트웨이에서 제거하고 역 마샬링, 복제 및 저널링해야합니다. 출력 메시지는 네트워크에 대해 마샬링되어야합니다. 이러한 작업은 입력 및 출력 중단 자에 의해 처리됩니다. Business Logic Processor와는 달리 이들은 느리고 독립적 인 IO 작업을 포함하기 때문에 동시 구성 요소입니다. 이들은 LMAX를 위해 특별히 설계되고 제작되었지만 (전체 아키텍처와 마찬가지로) 다른 곳에도 적용 할 수 있습니다.

비즈니스 로직 프로세서.

그것을 모두 기억에 남기십시오.

Business Logic Processor는 입력 메시지를 순차적으로 (메소드 호출의 형태로) 가져 와서 비즈니스 로직을 실행하고 출력 이벤트를 내 보냅니다. 완전히 메모리 내에서 작동하며 데이터베이스 나 다른 영구 저장소가 없습니다. 모든 데이터를 메모리에 유지하는 데는 두 가지 중요한 이점이 있습니다. 첫째, 속도가 빠릅니다. 액세스 속도가 느린 IO를 제공 할 데이터베이스가 없으며 모든 처리가 순차적으로 완료되기 때문에 실행할 트랜잭션 동작이 없습니다. 두 번째 장점은 프로그래밍을 단순화한다는 점입니다. 객체 / 관계 매핑이 필요하지 않습니다. 모든 코드는 데이터베이스로의 매핑을 위해 타협하지 않고 Java의 객체 모델을 사용하여 작성할 수 있습니다.

메모리 내 구조를 사용하면 중요한 결과가 발생합니다. 모든 것이 충돌하면 어떻게됩니까? 가장 탄력적 인 시스템조차도 권력을 잡는 사람에게 취약합니다. 이 문제를 처리하는 핵심은 Event Sourcing입니다. 즉, Business Logic Processor의 현재 상태가 입력 이벤트를 처리하여 완전히 파생 될 수 있음을 의미합니다. 입력 이벤트 스트림이 내구성 저장소 (입력 방해 장치의 작업 중 하나임)에 보관되어있는 한 이벤트를 재생하여 비즈니스 논리 엔진의 현재 상태를 언제든지 다시 만들 수 있습니다.

이것을 이해하는 좋은 방법은 버전 제어 시스템을 생각하는 것입니다. 버전 제어 시스템은 일련의 커밋이므로 언제든지 해당 커밋을 적용하여 작업 복사본을 만들 수 있습니다. VCS는 브랜칭을 지원해야하기 때문에 비즈니스 로직 프로세서보다 복잡합니다. 반면 비즈니스 로직 프로세서는 간단한 시퀀스입니다.

따라서 이론적으로 모든 이벤트를 다시 처리하여 Business Logic Processor의 상태를 다시 작성할 수 있습니다. 그러나 실제로는 하나를 위로 돌릴 필요가있을 때 너무 오래 걸릴 것입니다. 따라서 버전 제어 시스템과 마찬가지로 LMAX는 Business Logic Processor 상태의 스냅 샷을 만들고 스냅 샷에서 복원 할 수 있습니다. 매일 활동이 저조한 기간 동안 스냅 샷을 찍습니다. 비즈니스 로직 프로세서 재시작은 빠르며, JVM 재시작, 최근 스냅 샷로드, 일간의 저널 재생 등을 포함한 전체 재시작이 1 분 이내에 완료됩니다.

스냅 샷을 사용하면 새로운 비즈니스 로직 프로세서를 더 빠르게 시작할 수 있지만 비즈니스 로직 프로세서가 오후 2시에 충돌하면 충분히 빠르게 실행되지 않습니다. 결과적으로 LMAX는 항상 여러 비즈니스 로직 프로세서를 실행합니다 [6]. 각 입력 이벤트는 여러 프로세서에서 처리되지만 하나의 프로세서를 제외한 모든 프로세서의 출력은 무시됩니다. 라이브 프로세서가 실패하면 시스템은 다른 프로세서로 전환합니다. 페일 오버를 처리 할 수있는이 기능은 이벤트 소싱을 사용하는 또 다른 이점입니다.

이벤트를 복제본으로 소싱하면 마이크로 초 단위로 프로세서간에 전환 할 수 있습니다. 매일 밤 스냅 샷을 찍을뿐 아니라 매일 밤 비즈니스 로직 프로세서를 다시 시작합니다. 복제를 통해 중단 시간없이이 작업을 수행 할 수 있으므로 24/7 거래가 계속 처리됩니다.

이벤트 소싱에 대한 자세한 내용은 몇 년 전부터 내 사이트의 초안 패턴을 참조하십시오. 이 기사는 LMAX가 사용하는 이점보다는 시간적 관계를 다루는 데 더 중점을두고 있지만 핵심 아이디어를 설명합니다.

이벤트 소싱은 프로세서가 메모리 내에서 완전히 실행될 수 있기 때문에 가치가 있지만 진단을 위해서는 상당한 이점이 있습니다. 예기치 않은 동작이 발생하면 팀은 이벤트 시퀀스를 개발 환경에 복사하고 거기에서 재생합니다. 이를 통해 대부분의 환경에서 가능한 것보다 훨씬 쉽게 일어난 일을 검사 할 수 있습니다.

이 진단 기능은 비즈니스 진단까지 확장됩니다. 위험 관리와 같이 주문 처리에 필요하지 않은 중요한 계산이 필요한 비즈니스 작업이 있습니다. 예를 들어 현재 거래 위치를 기반으로 위험 프로필별로 상위 20 위 고객 목록을 얻는 것입니다. 팀은 복제 도메인 모델을 회전시키고 그곳에서 계산을 수행하여 핵심 주문 처리를 방해하지 않는 방식으로이를 처리합니다. 이러한 분석 도메인 모델은 변형 데이터 모델을 가질 수 있으며 메모리에 다른 데이터 세트를 유지하고 다른 시스템에서 실행할 수 있습니다.

성능 조정.

지금까지 비즈니스 논리 프로세서의 속도를 결정 짓는 열쇠는 메모리 내에서 순차적으로 모든 것을 수행하고 있다고 설명했습니다. 이 작업을 수행하면 개발자는 10K TPS [7]를 처리 할 수있는 코드를 작성할 수 있습니다. 그런 다음 좋은 코드의 간단한 요소에 집중하면 100K TPS 범위에이를 구현할 수 있다는 것을 알게되었습니다. 이것은 단지 잘 정의 된 코드와 작은 메소드를 필요로합니다. 핫스팟이 핫스팟을 최적화 할 수있게 해주 며, CPU가 코드를 실행하는 동안 더 효율적으로 캐싱 할 수있게합니다.

그것은 또 다른 규모의 진도를 얻기 위해 좀 더 영리함이 필요했습니다. LMAX 팀이 도움이되는 몇 가지 사항이 있습니다. 하나는 캐시 친화적이고 쓰레기에주의하도록 설계된 Java 콜렉션의 사용자 정의 구현을 작성하는 것이 었습니다 [8]. 이것의 일례는 원시적 Java long를 해시 맵 키로서 사용하고, 특별히 작성된 배열의 Map 구현 (LongToObjectHashMap)을 사용합니다. 일반적으로 그들은 데이터 구조의 선택이 종종 큰 차이를 만들어내는 것을 발견했습니다. 대부분의 프로그래머는 어떤 구현이이 컨텍스트에 적합한 지 생각하지 않고 지난번에 사용한 목록을 가져옵니다. [9]

최상위 수준의 성능에 도달하는 또 다른 기술은 성능 테스트에주의를 기울이는 것입니다. 나는 오랫동안 사람들이 성능 향상을 위해 기술에 대해 많이 이야기하지만, 실제로 차이를 만드는 것은 테스트하는 것임을 알게되었습니다. 좋은 프로그래머조차도 성능 논쟁을 잘 풀어서 결국 잘못 될 수 있으므로 최고의 프로그래머는 추측을 위해 프로파일 러와 테스트 케이스를 선호합니다. [10] LMAX 팀은 테스트를 처음 작성하는 것이 성능 테스트를위한 매우 효과적인 원칙이라는 것을 발견했습니다.

프로그래밍 모델.

이 처리 스타일은 비즈니스 논리를 작성하고 구성하는 방식에 몇 가지 제약 사항을 도입합니다. 첫 번째는 외부 서비스와의 상호 작용을 알아 내야한다는 것입니다. 외부 서비스 호출은 느려지므로 단일 스레드로 전체 주문 처리 기계를 정지시킵니다. 결과적으로 비즈니스 로직 내에서 외부 서비스를 호출 할 수 없습니다. 대신 출력 이벤트와의 상호 작용을 끝내고 다른 입력 이벤트가 다시 백업 될 때까지 기다려야합니다.

설명하기 위해 간단한 LMAX가 아닌 예제를 사용합니다. 신용 카드로 젤리 콩 주문을한다고 상상해보십시오. 간단한 소매 시스템으로 주문 정보를 얻고, 신용 카드 확인 서비스를 사용하여 신용 카드 번호를 확인한 다음 주문을 확인합니다. 주문을 처리하는 스레드는 신용 카드가 확인되기를 기다리는 동안 차단되지만 해당 블록은 사용자에게 그리 길지 않을 것이고 서버는 기다리는 동안 프로세서에서 다른 스레드를 항상 실행할 수 있습니다.

LMAX 아키텍처에서는이 작업을 두 개로 나눕니다. 첫 번째 작업은 주문 정보를 캡처하여 신용 카드 회사에 이벤트 (신용 카드 유효성 확인 요청)를 출력하여 완료됩니다. Business Logic Processor는 입력 이벤트 스트림에서 신용 카드 유효성 검증 이벤트를 수신 할 때까지 다른 고객의 처리 이벤트를 계속 수행합니다. 해당 이벤트를 처리 할 때 해당 주문에 대한 확인 작업을 수행합니다.

이러한 종류의 이벤트 중심 비동기 스타일에서는 비동기를 사용하여 응용 프로그램의 응답 속도를 향상시키는 방법이 익숙하지 않은 경우가 있습니다. 또한 원격 응용 프로그램에서 발생할 수있는 여러 가지 사항에 대해 더 명료하게 생각해야하므로 비즈니스 프로세스의 탄력성을 높일 수 있습니다.

프로그래밍 모델의 두 번째 특징은 오류 처리에 있습니다. 세션 및 데이터베이스 트랜잭션의 전통적인 모델은 유용한 오류 처리 기능을 제공합니다. 무엇인가 잘못되면, 지금까지 상호 작용에서 일어난 모든 일을 버리는 것이 쉽습니다. 세션 데이터는 일시적이며 폐기 될 수 있습니다. 복잡한 일이 발생하는 경우 사용자에게 약간의 자극이 있습니다. 데이터베이스 측에서 오류가 발생하면 트랜잭션을 롤백 할 수 있습니다.

LMAX의 메모리 내 구조는 입력 이벤트 전체에서 지속되므로 오류가있는 경우 해당 메모리를 일관성없는 상태로 두지 않는 것이 중요합니다. 그러나 자동 롤백 기능은 없습니다. 결과적으로 LMAX 팀은 메모리 내 영구 상태의 변이를 수행하기 전에 입력 이벤트가 완전히 유효하다는 점에 많은주의를 기울였습니다. 그들은 테스트가 생산에 들어가기 전에 이러한 종류의 문제를 제거하는 핵심 도구임을 발견했습니다.

입력 및 출력 방해 장치.

비즈니스 로직은 단일 스레드에서 발생하지만 비즈니스 오브젝트 메소드를 호출하기 전에 수행해야 할 태스크가 많습니다. 처리를위한 원래의 입력은 메시지 형태로 유선에서 나옵니다. 이 메시지는 Business Logic Processor가 사용하기에 편리한 형식으로 언 마샬해야합니다. 이벤트 소싱은 모든 입력 이벤트의 영구 저널을 유지해야하므로 각 입력 메시지를 내구성 스토어에 저널링해야합니다. 마지막으로 아키텍처는 비즈니스 로직 프로세서 클러스터를 사용하므로이 클러스터에서 입력 메시지를 복제해야합니다. 마찬가지로 출력 측면에서 출력 이벤트는 네트워크를 통한 전송을 위해 마샬링되어야합니다.

그림 2 : 입력 중단 기가 수행 한 활동 (UML 활동 다이어그램 표기법 사용)

리플리케이터와 저널러는 IO를 포함하므로 상대적으로 느립니다. 결국 Business Logic Processor의 핵심 아이디어는 IO 수행을 피하는 것입니다. 또한이 세 가지 작업은 상대적으로 독립적이므로 비즈니스 논리 프로세서가 메시지에서 작동하기 전에 모든 작업을 수행해야하지만 순서는 상관 없습니다. 따라서 각 거래가 후속 거래를 위해 시장을 변화시키는 Business Logic Processor와는 달리 동시성에 자연스럽게 적합합니다.

이러한 동시성을 처리하기 위해 LMAX 팀은 Disruptor [11]라고 부르는 특별한 동시성 컴포넌트를 개발했습니다.

LMAX 팀은 오픈 소스 라이센스로 Disruptor의 소스 코드를 공개했습니다.

조잡한 수준에서 Disruptor는 생성자가 개체를 놓은 큐의 멀티 캐스트 그래프로 생각할 수 있습니다. 이 개체는 별도의 다운 스트림 큐를 통해 병렬 소비를 위해 모든 소비자에게 보내집니다. 내부를 들여다 보면이 대기열 네트워크가 실제로 단일 데이터 구조 인 링 버퍼임을 알 수 있습니다. 각 생산자와 소비자는 현재 작업중인 버퍼의 슬롯을 나타내는 시퀀스 카운터를 가지고 있습니다. 각 생산자 / 소비자는 자체 시퀀스 카운터를 작성하지만 다른 시퀀스 카운터는 읽을 수 있습니다. 이렇게하면 생산자가 소비자 카운터를 읽고 카운터에 잠금을 설정하지 않고도 쓸 수있는 슬롯을 사용할 수 있습니다. 마찬가지로 소비자는 카운터를 보면서 다른 소비자가 메시지를 처리 한 후에 만 메시지를 처리 할 수 있도록 할 수 있습니다.

그림 3 : 입력 장애자는 한 명의 생산자와 네 명의 소비자를 조정합니다.

출력 중단 기는 비슷하지만 마샬링 및 출력을 위해 두 개의 순차적 인 소비자 만 있습니다. [12] 출력 이벤트는 여러 주제로 구성되어있어 관심있는 수신자에게만 메시지를 보낼 수 있습니다. 각 주제에는 자체적 인 방해 요소가 있습니다.

기술 한 혼란은 한 명의 생산자와 다수의 소비자가 사용하는 스타일로 사용되지만, 이는 혼란의 디자인에 대한 제한이 아닙니다. Disruptor는 여러 생산자와도 작업 할 수 있습니다. 이 경우에는 여전히 잠금이 필요하지 않습니다. [13]

Disruptor 설계의 이점은 문제가 발생하여 뒤처지는 경우 소비자가 쉽게 따라 잡을 수 있다는 것입니다. 마샬 러가 슬롯 15에서 처리 할 때 문제가 있고 수신기가 슬롯 31에있을 때 돌아 오면 슬롯 16-30에서 한 번에 데이터를 읽어 와서 따라 잡을 수 있습니다. Disruptor에서 데이터를 일괄 적으로 읽음으로써 소비자가 뒤처지는 것이 더 빨리 따라 잡을 수있게되어 전반적인 대기 시간이 줄어 듭니다.

저널리스트, 리플리케이터, 그리고 unmarshaler 중 하나를 사용하여 여기에 설명했습니다. 이것이 실제로 LMAX가하는 것입니다. 그러나이 설계로 인해 여러 구성 요소가 실행될 수 있습니다. 두 명의 저널러를 운영했다면 짝수 슬롯을 가져 가고 다른 저널러는 홀수 슬롯을 차지할 것입니다. 이렇게하면 이러한 IO 작업을보다 동시성있게 수행 할 수 있습니다.

링 버퍼는 크기가 크며 입력 버퍼 용 슬롯은 2,000 만 개, 출력 버퍼 용 슬롯은 4 백만 개입니다. 시퀀스 카운터는 링 슬롯이 감싸는 경우에도 단조롭게 증가하는 64 비트 길이의 정수입니다. [14] 컴파일러는 시퀀스 계수기 번호에서 슬롯 번호로 매핑하는 효율적인 계수 연산을 수행 할 수 있도록 버퍼의 크기를 2의 제곱으로 설정합니다. 나머지 시스템과 마찬가지로, 중단 장치는 밤새 반송됩니다. 이 바운스는 주로 메모리를 닦아서 거래하는 동안 값 비싼 가비지 수집 이벤트가 발생하지 않도록합니다. (또한 정기적으로 다시 시작하는 것이 좋은 습관이라고 생각하여 응급 상황을 대비하여 어떻게 연습할지 미리 연습 할 수 있습니다.)

저널 담당자의 임무는 모든 이벤트를 영구적 인 형식으로 저장하여 문제가 발생하면 재생할 수 있도록하는 것입니다. LMAX는 파일 시스템만을위한 데이터베이스를 사용하지 않습니다. 이벤트를 디스크로 스트리밍합니다. 현대적인 관점에서 볼 때, 기계식 디스크는 무작위로 접근하기에는 속도가 매우 느리지 만 스트리밍에는 매우 빠르기 때문에 태그 형 "디스크는 새로운 테이프"입니다.

앞에서 필자는 LMAX가 클러스터에서 시스템의 여러 복사본을 실행하여 신속한 장애 조치를 지원한다고 언급했습니다. 복제자는 이러한 노드를 동기화 상태로 유지합니다. LMAX의 모든 통신은 IP 멀티 캐스팅을 사용하므로 클라이언트는 어떤 IP 주소가 마스터 노드인지 알 필요가 없습니다. 마스터 노드 만 입력 이벤트를 직접 수신하고 복제기를 실행합니다. 리플리케이터는 입력 이벤트를 슬레이브 노드로 브로드 캐스트합니다. 마스터 노드가 다운되면, 하트 비트가 없음을 알게되고, 다른 노드는 마스터가되어 입력 이벤트 처리를 시작하고 해당 복제기를 시작합니다. 각 노드에는 자체 입력 방해 장치가 있으며 자체 저널이 있으며 자체 비 정렬 성을 수행합니다.

IP 멀티 캐스팅을 사용하더라도 IP 메시지가 다른 노드에서 다른 순서로 도착할 수 있기 때문에 복제가 여전히 필요합니다. 마스터 노드는 나머지 프로세싱을위한 결정 성 순서를 제공합니다.

unmarshaler는 이벤트 데이터를 와이어에서 비즈니스 로직 프로세서에서 동작을 호출하는 데 사용할 수있는 Java 객체로 변환합니다. 따라서 다른 소비자와 달리 링 버퍼의 데이터를 수정하여 비 정렬 화 된 객체를 저장할 수 있어야합니다. 여기서 규칙은 소비자가 링 버퍼에 쓸 수 있지만 쓰기 가능한 각 필드에는 하나의 병렬 소비자 만 쓸 수 있다는 것입니다. 이것은 단 하나의 작가를 갖는 것의 원칙을 유지합니다. [16]

그림 4 : 장애 요인이 확장 된 LMAX 아키텍처.

Disruptor는 LMAX 시스템 외부에서 사용할 수있는 범용 구성 요소입니다. 보통 금융 회사는 자신의 시스템에 대해 매우 비밀 스럽기 때문에 비즈니스와 밀접하지 않은 항목에 대해서도 조용하게 유지합니다. LMAX가 전반적인 아키텍처에 대해 개방적 일뿐만 아니라, 오픈 소스로 인해 혼란을 일으켰습니다. 이것은 다른 조직이 방해 요소를 사용할 수 있도록 허용 할뿐만 아니라 동시성 속성에 대한 더 많은 테스트를 허용합니다.

대기열과 기계적 동정심이 부족합니다.

LMAX 아키텍처는 사람들의 관심을 사로 잡았습니다. 왜냐하면 대부분의 사람들이 생각하고있는 고성능 시스템에 접근하는 방식이 매우 다르기 때문입니다. 지금까지 나는 그것이 어떻게 작동하는지에 대해서 이야기했지만, 왜 이렇게 개발되었는지에 대해 너무 많이 파고 들지 않았습니다. 이 이야기는 그 자체로 흥미 롭습니다. 왜냐하면이 건축물은 나타나지 않았기 때문입니다. 재래식 대안을 시도하고 결함이있는 곳을 실현하는 데 오랜 시간이 걸렸습니다. 팀이이 문제에 착수하기 전에.

오늘날 대부분의 비즈니스 시스템에는 트랜잭션 데이터베이스를 통해 조정 된 여러 개의 활성 세션을 사용하는 핵심 아키텍처가 있습니다. LMAX 팀은이 방법에 익숙했으며 LMAX에서는 작동하지 않을 것이라고 확신했습니다. 이 평가는 LMAX를 설립 한 모기업 인 Betfair의 경험에서 시작되었습니다. Betfair는 스포츠 경기에 내기를 걸 수있는 도박 사이트입니다. 그것은 많은 논쟁과 함께 매우 많은 양의 트래픽을 처리합니다. 스포츠 베팅은 특정 이벤트를 중심으로 폭발하는 경향이 있습니다. 이 작품을 만들기 위해 주변에서 가장 인기있는 데이터베이스 설치 중 하나를 가지고 있으며 작동시키기 위해 많은 부 자연스러운 행위를해야했습니다. 이 경험을 토대로 Betfair의 성능을 유지하는 것이 얼마나 어려운지 알았고 이러한 종류의 아키텍처가 거래 사이트에서 요구하는 매우 낮은 대기 시간 동안 작동하지 않을 것이라고 확신했습니다. 결과적으로 다른 접근 방식을 찾아야했습니다.

그들의 초기 접근법은 요즘 말하는 많은 것들을 따르는 것이 었습니다 - 성능을 높이려면 명시 적 동시성을 사용해야합니다. 이 시나리오에서는 여러 스레드가 병렬로 명령을 처리 할 수 있음을 의미합니다. 그러나 동시성의 경우처럼 스레드가 서로 통신해야하기 때문에 어려움이 있습니다. 주문 처리는 시장 조건을 변경하며 이러한 조건은 전달되어야합니다.

그들이 초기에 탐구 한 접근법은 배우 모델과 그 사촌 세다 (SEDA)였다. 액터 모델은 대기열을 통해 서로 통신하는 자체 스레드가있는 독립적 인 활성 객체에 의존합니다. 많은 사람들이 이런 종류의 동시성 모델이 잠금 프리미티브 (lock primitives)에 기반을 둔 무언가를 시도하는 것보다 다루기가 훨씬 쉽다는 것을 알게됩니다.

팀은 액터 모델을 사용하여 프로토 타입 교환을 구축하고 성능 테스트를 수행했습니다. 그들이 발견 한 사실은 프로세서가 응용 프로그램의 실제 논리를 수행하는 것보다 대기열을 관리하는 데 더 많은 시간을 소비했기 때문입니다. 대기열 액세스가 병목이었습니다.

이와 같이 성능을 향상 시키면 현대 하드웨어가 구축되는 방식을 고려하는 것이 중요해진다. Martin Thompson이 사용하는 문구는 "기계적 동정"입니다. 이 용어는 경주 용 자동차 운전에서 비롯되며 운전자가 차안에 타고난 느낌을 갖고 있음을 반영하기 때문에 운전자가 차 안에서 최상의 결과를 얻는 방법을 느낄 수 있습니다. 많은 프로그래머와 나는이 캠프에 빠져 있다고 고백하며, 프로그래밍과 하드웨어의 상호 작용에 대해 기계적으로 많은 동정을 갖지 않습니다. 더 나쁜 것은 많은 프로그래머들이 기계적 동정심을 갖고 있다고 생각하지만 하드웨어가 어떻게 작동했는지에 대한 개념을 기반으로하고 있습니다.

대기 시간에 영향을 미치는 최신 CPU의 주요 요인 중 하나는 CPU가 메모리와 상호 작용하는 방식입니다. 요즘에는 메인 메모리를 사용하는 것이 CPU 사용 시간 측면에서 매우 느립니다. CPU에는 여러 수준의 캐시가 있으며 각 캐시는 훨씬 빠릅니다. 속도를 높이려면 해당 캐시에서 코드와 데이터를 가져와야합니다.

한 단계에서 액터 모델이 여기에서 도움이됩니다. 액터는 캐싱을위한 자연스러운 단위 인 코드와 데이터를 클러스터링하는 자체 객체로 생각할 수 있습니다. 그러나 행위자는 대기열을 통해 통신해야하며 LMAX 팀은 캐싱을 방해하는 대기열임을 관찰했습니다.

설명은 다음과 같이 실행됩니다. 일부 데이터를 대기열에 넣으려면 해당 대기열에 쓸 필요가 있습니다. 마찬가지로 큐에서 데이터를 가져 오려면 제거를 수행하기 위해 큐에 쓸 필요가 있습니다. 이것은 쓰기 경합입니다. 하나 이상의 클라이언트가 동일한 데이터 구조에 작성해야 할 수도 있습니다. 쓰기 경합을 처리하기 위해 큐는 종종 잠금을 사용합니다. 그러나 잠금을 사용하면 커널로 컨텍스트를 전환 할 수 있습니다. 이것이 발생하면 관련된 프로세서가 캐시의 데이터를 잃을 가능성이 있습니다.

결론은 최상의 캐싱 동작을 얻으려면 메모리 위치에 하나의 코어 쓰기 만 사용하는 디자인이 필요하다는 것입니다. 여러 독자가 괜찮 으면 프로세서는 종종 캐시간에 특별한 고속 링크를 사용합니다. 그러나 대기열은 한 작가의 원칙에 어긋납니다.

이 분석을 통해 LMAX 팀은 몇 가지 결론을 이끌어 냈습니다. 첫째로 그것은 단일 작가의 제약을 따르는 혼란의 설계를 이끌어 냈습니다. 두 번째로 단일 스레드 비즈니스 로직 접근법을 탐색하여 동시성 관리가 해제 된 경우 단일 스레드가 얼마나 빨라질 수 있는지에 대한 질문을 던지게되었습니다.

단일 스레드로 작업하는 본질은 한 코어에서 하나의 스레드가 실행되고 캐시가 워밍업되도록하고 최대한 많은 메모리 액세스가 주 메모리보다는 캐시로 이동하도록하는 것입니다. 즉, 코드와 작업 데이터 세트 모두 가능한 한 일관성있게 액세스해야합니다. 또한 작은 객체에 코드와 데이터를 함께 보관하면 캐시간에 단위를 바꿀 수 있으므로 캐시 관리가 단순 해지고 성능이 다시 향상됩니다.

LMAX 아키텍처의 핵심 요소 중 하나는 성능 테스트를 사용하는 것이 었습니다. 액터 기반 접근법을 고려하고 포기하는 것은 프로토 타입을 작성하고 성능을 테스트하여 얻은 것입니다. 성능 테스트를 통해 다양한 구성 요소의 성능을 향상시키는 것과 비슷한 단계가 가능했습니다. Mechanical sympathy is very valuable - it helps to form hypotheses about what improvements you can make, and guides you to forward steps rather than backward ones - but in the end it's the testing gives you the convincing evidence.

Performance testing in this style, however, is not a well-understood topic. Regularly the LMAX team stresses that coming up with meaningful performance tests is often harder than developing the production code. Again mechanical sympathy is important to developing the right tests. Testing a low level concurrency component is meaningless unless you take into account the caching behavior of the CPU.

One particular lesson is the importance of writing tests against null components to ensure the performance test is fast enough to really measure what real components are doing. Writing fast test code is no easier than writing fast production code and it's too easy to get false results because the test isn't as fast as the component it's trying to measure.

Should you use this architecture?

At first glance, this architecture appears to be for a very small niche. After all the driver that led to it was to be able to run lots of complex transactions with very low latency - most applications don't need to run at 6 million TPS.

But the thing that fascinates me about this application, is that they have ended up with a design which removes much of the programming complexity that plagues many software projects. The traditional model of concurrent sessions surrounding a transactional database isn't free of hassles. There's usually a non-trivial effort that goes into the relationship with the database. Object/relational mapping tools can help much of the pain of dealing with a database, but it doesn't deal with it all. Most performance tuning of enterprise applications involves futzing around with SQL.

These days, you can get more main memory into your servers than us old guys could get as disk space. More and more applications are quite capable of putting all their working set in main memory - thus eliminating a source of both complexity and sluggishness. Event Sourcing provides a way to solve the durability problem for an in-memory system, running everything in a single thread solves the concurrency issue. The LMAX experience suggests that as long as you need less than a few million TPS, you'll have enough performance headroom.

There is a considerable overlap here with the growing interest in CQRS. An event sourced, in-memory processor is a natural choice for the command-side of a CQRS system. (Although the LMAX team does not currently use CQRS.)

So what indicates you shouldn't go down this path? This is always a tricky questions for little-known techniques like this, since the profession needs more time to explore its boundaries. A starting point, however, is to think of the characteristics that encourage the architecture.

One characteristic is that this is a connected domain where processing one transaction always has the potential to change how following ones are processed. With transactions that are more independent of each other, there's less need to coordinate, so using separate processors running in parallel becomes more attractive.

LMAX concentrates on figuring the consequences of how events change the world. Many sites are more about taking an existing store of information and rendering various combinations of that information to as many eyeballs as they can find - eg think of any media site. Here the architectural challenge often centers on getting your caches right.

Another characteristic of LMAX is that this is a backend system, so it's reasonable to consider how applicable it would be for something acting in an interactive mode. Increasingly web application are helping us get used to server systems that react to requests, an aspect that does fit in well with this architecture. Where this architecture goes further than most such systems is its absolute use of asynchronous communications, resulting in the changes to the programming model that I outlined earlier.

These changes will take some getting used to for most teams. Most people tend to think of programming in synchronous terms and are not used to dealing with asynchrony. Yet it's long been true that asynchronous communication is an essential tool for responsiveness. It will be interesting to see if the wider use of asynchronous communication in the javascript world, with AJAX and node. js, will encourage more people to investigate this style. The LMAX team found that while it took a bit of time to adjust to asynchronous style, it soon became natural and often easier. In particular error handling was much easier to deal with under this approach.

The LMAX team certainly feels that the days of the coordinating transactional database are numbered. The fact that you can write software more easily using this kind of architecture and that it runs more quickly removes much of the justification for the traditional central database.

For my part, I find this a very exciting story. Much of my goal is to concentrate on software that models complex domains. An architecture like this provides good separation of concerns, allowing people to focus on Domain-Driven Design and keeping much of the platform complexity well separated. The close coupling between domain objects and databases has always been an irritation - approaches like this suggest a way out.

For articles on similar topics…

…take a look at the following tags:

1: The Free Lunch is Over.

This is the title of a famous essay by Herb Sutter. He describes the "free lunch" as the ever increasing clock speed of processors that regularly gave us more CPU performance every year. His point was that such clock cycle increases were no longer going to happen, instead performance increases would come in terms of multiple cores. But to take advantage of multiple cores, you need software that is capable of working concurrently - so without a shift in programming style people would no longer get the performance lunch for free.

2: I shall remain silent on what I think about the value of this innovation.

3: User Base.

All trading systems need low latency, since one trade can affect later trades and there's a lot of competition based on rapid reaction. Most trading platforms are for professionals - banks, brokers, etc - and typically have hundreds of users. A retail system has the potential for many more users, Betfair has millions of users and LMAX is designed for that scale. (The LMAX team isn't allowed to disclose its actual volumes.)

As it turns out, although a retail system has a lot of users, most of the activity in comes from market makers. During volatile periods an instrument can get hundreds of updates per second, with unusual micro-bursts of hundreds of transactions within a single microsecond.

4: Hardware.

The 6 million TPS benchmark was measured on a 3Ghz dual-socket quad-core Nehalem based Dell server with 32GB RAM.

5: The team does not use the name Business Logic Processor, in fact they have no name for that component, just referring to it as the business logic or core services. I've given it a name to make it easier to talk about in this article.

6: Currently LMAX runs two Business Logic Processors in its main data center and a third at a disaster recovery site. All three process input events.

7: What's in a transaction.

When people talk about transaction timing, one of the problems is what exactly is in a transaction. In some cases it's little more than inserting a new record in a database. LMAX's transactions are reasonably complex, more complex than a typical retail sale.

Placing an order in an exchange involves:

checking the target market is open to take orders checking the order is valid for that market choosing the right matching policy for the type of order sequencing the order so that each order is matched at the best possible price and matched with the right liquidity creating and publicizing the trades made as a consequence of the match updating prices based on the new trades.

8: At this scale of latency, you have to be aware of the garbage collector. For almost all systems these days, a modern GC compaction isn't going to have any noticeable effect on performance. However when you are trying to process millions of transactions per second with minimum jitter, a GC pause becomes a problem. The thing to remember is that short lived objects are ok, as they get collected quickly. So are objects that are permanent, since they will live for ever. The problematic objects are those that will get promoted to an older generation, but will eventually die. As this fragments the older generation region, it will trigger the compaction.

9: I rarely think about which collection implementation to use. This is perfectly reasonable when you're not in performance critical code. Different contexts suggest different behavior.

10: An interesting side-note. While the LMAX team shares much of the current interest in functional programming, they believe that the OO approach provides a better approach for this kind of problem. They've noticed that as they work to write faster code, they move away from a functional style towards OO style. Partly this because of the copying of data that functional styles require to maintain immutability. But it's also because objects provide a better model of a complex domain with a richer choice of data structures.

11: The name "disruptor" was inspired from a couple of sources. One is the the fact that the LMAX team sees this component as something that disrupts current thinking on concurrency. The other is a response to the fact that Java is introducing a phaser, so it's natural to include disruptors too.

12: It would be possible to journal the output events too. This would have the advantage of not needing to recalculate them should they need to be replayed for downstream services. In practice, however, this isn't worthwhile. The business logic is deterministic and very fast, so there's no gain from storing the results.

13: Although it does need to use CAS instructions in this case. See the disruptor technical paper for more information.

14: This does mean that if they process a billion transactions per second the counter will wrap in 292 years, causing some hell to break loose. They have decided that fixing this is not a high priority.

15: SSDs are better at random access, but a disk-like IO system slows them down.

16: Another complication when writing fields is you have to ensure that any fields being written to are separated into different cache lines.

17: Ensuring a single writer to a memory location.

A complication in following the single-writer principle is that processors don't grab memory one location at a time. Rather they sweep up multiple contiguous locations, called a cache line , into cache in one go. Accessing memory in cache line chunks is obviously more efficient, but also means that you have to ensure you don't have locations within that cache line that are written by different cores. So, for example, the Disruptor's sequence counter are padded to ensure they appear in separate cache lines.

감사 인사.

Financial institutions are usually secretive with their technical work, usually with little reason. This is a problem as it hampers the ability for the profession to learn from experience. So I'm especially thankful for LMAX's openness in discussing their experiences - both with this article and in their other material.

The main creators of the Disruptor are Martin Thompson, Mike Barker, and Dave Farley.

Martin Thompson and Dave Farley gave me a detailed walk-through of the LMAX architecture that served as the basis for this article. They also responded swiftly to questions to improve my early drafts.

Concurrent programming is a tricky field that requires lots of attention to be competent at - and I have not put that effort in. As a result I'm entirely dependent upon others for understanding on concurrency and am thankful for their patient advice.

추가 독서.

If you'd prefer a video description of the LMAX architecture from LMAX team members, your best bet is the QCon presentation given in San Francisco 2010 by Martin Thompson and Michael Barker.

The source code for the Disruptor is available as open source. There is also a good technical paper (pdf) that goes into more depth as well as a collection of blogs and articles on it.

Various members of the LMAX team have their own blogs: Martin Thompson, Michael Barker, and Trisha Gee.

Trading Floor Architecture.

Available Languages.

Download Options.

View with Adobe Reader on a variety of devices.

목차.

Trading Floor Architecture.

Executive Overview.

Increased competition, higher market data volume, and new regulatory demands are some of the driving forces behind industry changes. Firms are trying to maintain their competitive edge by constantly changing their trading strategies and increasing the speed of trading.

A viable architecture has to include the latest technologies from both network and application domains. It has to be modular to provide a manageable path to evolve each component with minimal disruption to the overall system. Therefore the architecture proposed by this paper is based on a services framework. We examine services such as ultra-low latency messaging, latency monitoring, multicast, computing, storage, data and application virtualization, trading resiliency, trading mobility, and thin client.

The solution to the complex requirements of the next-generation trading platform must be built with a holistic mindset, crossing the boundaries of traditional silos like business and technology or applications and networking.

This document's main goal is to provide guidelines for building an ultra-low latency trading platform while optimizing the raw throughput and message rate for both market data and FIX trading orders.

To achieve this, we are proposing the following latency reduction technologies:

• High speed inter-connect—InfiniBand or 10 Gbps connectivity for the trading cluster.

• High-speed messaging bus.

• Application acceleration via RDMA without application re-code.

• Real-time latency monitoring and re-direction of trading traffic to the path with minimum latency.

Industry Trends and Challenges.

Next-generation trading architectures have to respond to increased demands for speed, volume, and efficiency. For example, the volume of options market data is expected to double after the introduction of options penny trading in 2007. There are also regulatory demands for best execution, which require handling price updates at rates that approach 1M msg/sec. for exchanges. They also require visibility into the freshness of the data and proof that the client got the best possible execution.

In the short term, speed of trading and innovation are key differentiators. An increasing number of trades are handled by algorithmic trading applications placed as close as possible to the trade execution venue. A challenge with these "black-box" trading engines is that they compound the volume increase by issuing orders only to cancel them and re-submit them. The cause of this behavior is lack of visibility into which venue offers best execution. The human trader is now a "financial engineer," a "quant" (quantitative analyst) with programming skills, who can adjust trading models on the fly. Firms develop new financial instruments like weather derivatives or cross-asset class trades and they need to deploy the new applications quickly and in a scalable fashion.

In the long term, competitive differentiation should come from analysis, not just knowledge. The star traders of tomorrow assume risk, achieve true client insight, and consistently beat the market (source IBM: www-935.ibm/services/us/imc/pdf/ge510-6270-trader. pdf).

Business resilience has been one main concern of trading firms since September 11, 2001. Solutions in this area range from redundant data centers situated in different geographies and connected to multiple trading venues to virtual trader solutions offering power traders most of the functionality of a trading floor in a remote location.

The financial services industry is one of the most demanding in terms of IT requirements. The industry is experiencing an architectural shift towards Services-Oriented Architecture (SOA), Web services, and virtualization of IT resources. SOA takes advantage of the increase in network speed to enable dynamic binding and virtualization of software components. This allows the creation of new applications without losing the investment in existing systems and infrastructure. The concept has the potential to revolutionize the way integration is done, enabling significant reductions in the complexity and cost of such integration (gigaspaces/download/MerrilLynchGigaSpacesWP. pdf).

Another trend is the consolidation of servers into data center server farms, while trader desks have only KVM extensions and ultra-thin clients (e. g., SunRay and HP blade solutions). High-speed Metro Area Networks enable market data to be multicast between different locations, enabling the virtualization of the trading floor.

High-Level Architecture.

Figure 1 depicts the high-level architecture of a trading environment. The ticker plant and the algorithmic trading engines are located in the high performance trading cluster in the firm's data center or at the exchange. The human traders are located in the end-user applications area.

Functionally there are two application components in the enterprise trading environment, publishers and subscribers. The messaging bus provides the communication path between publishers and subscribers.

There are two types of traffic specific to a trading environment:

• Market Data—Carries pricing information for financial instruments, news, and other value-added information such as analytics. It is unidirectional and very latency sensitive, typically delivered over UDP multicast. It is measured in updates/sec. and in Mbps. Market data flows from one or multiple external feeds, coming from market data providers like stock exchanges, data aggregators, and ECNs. Each provider has their own market data format. The data is received by feed handlers, specialized applications which normalize and clean the data and then send it to data consumers, such as pricing engines, algorithmic trading applications, or human traders. Sell-side firms also send the market data to their clients, buy-side firms such as mutual funds, hedge funds, and other asset managers. Some buy-side firms may opt to receive direct feeds from exchanges, reducing latency.

Figure 1 Trading Architecture for a Buy Side/Sell Side Firm.

There is no industry standard for market data formats. Each exchange has their proprietary format. Financial content providers such as Reuters and Bloomberg aggregate different sources of market data, normalize it, and add news or analytics. Examples of consolidated feeds are RDF (Reuters Data Feed), RWF (Reuters Wire Format), and Bloomberg Professional Services Data.

To deliver lower latency market data, both vendors have released real-time market data feeds which are less processed and have less analytics:

– Bloomberg B-Pipe—With B-Pipe, Bloomberg de-couples their market data feed from their distribution platform because a Bloomberg terminal is not required for get B-Pipe. Wombat and Reuters Feed Handlers have announced support for B-Pipe.

A firm may decide to receive feeds directly from an exchange to reduce latency. The gains in transmission speed can be between 150 milliseconds to 500 milliseconds. These feeds are more complex and more expensive and the firm has to build and maintain their own ticker plant (financetech/featured/showArticle. jhtml? articleID=60404306).

• Trading Orders—This type of traffic carries the actual trades. It is bi-directional and very latency sensitive. It is measured in messages/sec. and Mbps. The orders originate from a buy side or sell side firm and are sent to trading venues like an Exchange or ECN for execution. The most common format for order transport is FIX (Financial Information eXchange—fixprotocol. org/). The applications which handle FIX messages are called FIX engines and they interface with order management systems (OMS).

An optimization to FIX is called FAST (Fix Adapted for Streaming), which uses a compression schema to reduce message length and, in effect, reduce latency. FAST is targeted more to the delivery of market data and has the potential to become a standard. FAST can also be used as a compression schema for proprietary market data formats.

To reduce latency, firms may opt to establish Direct Market Access (DMA).

DMA is the automated process of routing a securities order directly to an execution venue, therefore avoiding the intervention by a third-party (towergroup/research/content/glossary. jsp? page=1&glossaryId=383). DMA requires a direct connection to the execution venue.

The messaging bus is middleware software from vendors such as Tibco, 29West, Reuters RMDS, or an open source platform such as AMQP. The messaging bus uses a reliable mechanism to deliver messages. The transport can be done over TCP/IP (TibcoEMS, 29West, RMDS, and AMQP) or UDP/multicast (TibcoRV, 29West, and RMDS). One important concept in message distribution is the "topic stream," which is a subset of market data defined by criteria such as ticker symbol, industry, or a certain basket of financial instruments. Subscribers join topic groups mapped to one or multiple sub-topics in order to receive only the relevant information. In the past, all traders received all market data. At the current volumes of traffic, this would be sub-optimal.

The network plays a critical role in the trading environment. Market data is carried to the trading floor where the human traders are located via a Campus or Metro Area high-speed network. High availability and low latency, as well as high throughput, are the most important metrics.

The high performance trading environment has most of its components in the Data Center server farm. To minimize latency, the algorithmic trading engines need to be located in the proximity of the feed handlers, FIX engines, and order management systems. An alternate deployment model has the algorithmic trading systems located at an exchange or a service provider with fast connectivity to multiple exchanges.

Deployment Models.

There are two deployment models for a high performance trading platform. Firms may chose to have a mix of the two:

• Data Center of the trading firm (Figure 2)—This is the traditional model, where a full-fledged trading platform is developed and maintained by the firm with communication links to all the trading venues. Latency varies with the speed of the links and the number of hops between the firm and the venues.

Figure 2 Traditional Deployment Model.

• Co-location at the trading venue (exchanges, financial service providers (FSP)) (Figure 3)

The trading firm deploys its automated trading platform as close as possible to the execution venues to minimize latency.

Figure 3 Hosted Deployment Model.

Services-Oriented Trading Architecture.

We are proposing a services-oriented framework for building the next-generation trading architecture. This approach provides a conceptual framework and an implementation path based on modularization and minimization of inter-dependencies.

This framework provides firms with a methodology to:

• Evaluate their current state in terms of services.

• Prioritize services based on their value to the business.

• Evolve the trading platform to the desired state using a modular approach.

The high performance trading architecture relies on the following services, as defined by the services architecture framework represented in Figure 4.

Figure 4 Service Architecture Framework for High Performance Trading.

Table 1 Service Descriptions and Technologies.

Ultra-low latency messaging.

Instrumentation—appliances, software agents, and router modules.

OS and I/O virtualization, Remote Direct Memory Access (RDMA), TCP Offload Engines (TOE)

Middleware which parallelizes application processing.

Middleware which speeds-up data access for applications, e. g., in-memory caching.

Hardware-assisted multicast replication through-out the network; multicast Layer 2 and Layer 3 optimizations.

Virtualization of storage hardware (VSANs), data replication, remote backup, and file virtualization.

Trading resilience and mobility.

Local and site load balancing and high availability campus networks.

Wide Area application services.

Acceleration of applications over a WAN connection for traders residing off-campus.

Thin client service.

De-coupling of the computing resources from the end-user facing terminals.

Ultra-Low Latency Messaging Service.

This service is provided by the messaging bus, which is a software system that solves the problem of connecting many-to-many applications. The system consists of:

• A set of pre-defined message schemas.

• A set of common command messages.

• A shared application infrastructure for sending the messages to recipients. The shared infrastructure can be based on a message broker or on a publish/subscribe model.

The key requirements for the next-generation messaging bus are (source 29West):

• Lowest possible latency (e. g., less than 100 microseconds)

• Stability under heavy load (e. g., more than 1.4 million msg/sec.)

• Control and flexibility (rate control and configurable transports)

There are efforts in the industry to standardize the messaging bus. Advanced Message Queueing Protocol (AMQP) is an example of an open standard championed by J. P. Morgan Chase and supported by a group of vendors such as Cisco, Envoy Technologies, Red Hat, TWIST Process Innovations, Iona, 29West, and iMatix. Two of the main goals are to provide a more simple path to inter-operability for applications written on different platforms and modularity so that the middleware can be easily evolved.

In very general terms, an AMQP server is analogous to an E-mail server with each exchange acting as a message transfer agent and each message queue as a mailbox. The bindings define the routing tables in each transfer agent. Publishers send messages to individual transfer agents, which then route the messages into mailboxes. Consumers take messages from mailboxes, which creates a powerful and flexible model that is simple (source: amqp. org/tikiwiki/tiki-index. php? page=OpenApproach#Why_AMQP_).

Latency Monitoring Service.

The main requirements for this service are:

• Sub-millisecond granularity of measurements.

• Near-real time visibility without adding latency to the trading traffic.

• Ability to differentiate application processing latency from network transit latency.

• Ability to handle high message rates.

• Provide a programmatic interface for trading applications to receive latency data, thus enabling algorithmic trading engines to adapt to changing conditions.

• Correlate network events with application events for troubleshooting purposes.

Latency can be defined as the time interval between when a trade order is sent and when the same order is acknowledged and acted upon by the receiving party.

Addressing the latency issue is a complex problem, requiring a holistic approach that identifies all sources of latency and applies different technologies at different layers of the system.

Figure 5 depicts the variety of components that can introduce latency at each layer of the OSI stack. It also maps each source of latency with a possible solution and a monitoring solution. This layered approach can give firms a more structured way of attacking the latency issue, whereby each component can be thought of as a service and treated consistently across the firm.

Maintaining an accurate measure of the dynamic state of this time interval across alternative routes and destinations can be of great assistance in tactical trading decisions. The ability to identify the exact location of delays, whether in the customer's edge network, the central processing hub, or the transaction application level, significantly determines the ability of service providers to meet their trading service-level agreements (SLAs). For buy-side and sell-side forms, as well as for market-data syndicators, the quick identification and removal of bottlenecks translates directly into enhanced trade opportunities and revenue.

Figure 5 Latency Management Architecture.

Cisco Low-Latency Monitoring Tools.

Traditional network monitoring tools operate with minutes or seconds granularity. Next-generation trading platforms, especially those supporting algorithmic trading, require latencies less than 5 ms and extremely low levels of packet loss. On a Gigabit LAN, a 100 ms microburst can cause 10,000 transactions to be lost or excessively delayed.

Cisco offers its customers a choice of tools to measure latency in a trading environment:

• Bandwidth Quality Manager (BQM) (OEM from Corvil)

• Cisco AON-based Financial Services Latency Monitoring Solution (FSMS)

Bandwidth Quality Manager.

Bandwidth Quality Manager (BQM) 4.0 is a next-generation network application performance management product that enables customers to monitor and provision their network for controlled levels of latency and loss performance. While BQM is not exclusively targeted at trading networks, its microsecond visibility combined with intelligent bandwidth provisioning features make it ideal for these demanding environments.

Cisco BQM 4.0 implements a broad set of patented and patent-pending traffic measurement and network analysis technologies that give the user unprecedented visibility and understanding of how to optimize the network for maximum application performance.

Cisco BQM is now supported on the product family of Cisco Application Deployment Engine (ADE). The Cisco ADE product family is the platform of choice for Cisco network management applications.

BQM Benefits.

Cisco BQM micro-visibility is the ability to detect, measure, and analyze latency, jitter, and loss inducing traffic events down to microsecond levels of granularity with per packet resolution. This enables Cisco BQM to detect and determine the impact of traffic events on network latency, jitter, and loss. Critical for trading environments is that BQM can support latency, loss, and jitter measurements one-way for both TCP and UDP (multicast) traffic. This means it reports seamlessly for both trading traffic and market data feeds.

BQM allows the user to specify a comprehensive set of thresholds (against microburst activity, latency, loss, jitter, utilization, etc.) on all interfaces. BQM then operates a background rolling packet capture. Whenever a threshold violation or other potential performance degradation event occurs, it triggers Cisco BQM to store the packet capture to disk for later analysis. This allows the user to examine in full detail both the application traffic that was affected by performance degradation ("the victims") and the traffic that caused the performance degradation ("the culprits"). This can significantly reduce the time spent diagnosing and resolving network performance issues.

BQM is also able to provide detailed bandwidth and quality of service (QoS) policy provisioning recommendations, which the user can directly apply to achieve desired network performance.

BQM Measurements Illustrated.

To understand the difference between some of the more conventional measurement techniques and the visibility provided by BQM, we can look at some comparison graphs. In the first set of graphs (Figure 6 and Figure 7), we see the difference between the latency measured by BQM's Passive Network Quality Monitor (PNQM) and the latency measured by injecting ping packets every 1 second into the traffic stream.

In Figure 6, we see the latency reported by 1-second ICMP ping packets for real network traffic (it is divided by 2 to give an estimate for the one-way delay). It shows the delay comfortably below about 5ms for almost all of the time.

Figure 6 Latency Reported by 1-Second ICMP Ping Packets for Real Network Traffic.

In Figure 7, we see the latency reported by PNQM for the same traffic at the same time. Here we see that by measuring the one-way latency of the actual application packets, we get a radically different picture. Here the latency is seen to be hovering around 20 ms, with occasional bursts far higher. The explanation is that because ping is sending packets only every second, it is completely missing most of the application traffic latency. In fact, ping results typically only indicate round trip propagation delay rather than realistic application latency across the network.

Figure 7 Latency Reported by PNQM for Real Network Traffic.

In the second example (Figure 8), we see the difference in reported link load or saturation levels between a 5-minute average view and a 5 ms microburst view (BQM can report on microbursts down to about 10-100 nanosecond accuracy). The green line shows the average utilization at 5-minute averages to be low, maybe up to 5 Mbits/s. The dark blue plot shows the 5ms microburst activity reaching between 75 Mbits/s and 100 Mbits/s, the LAN speed effectively. BQM shows this level of granularity for all applications and it also gives clear provisioning rules to enable the user to control or neutralize these microbursts.

Figure 8 Difference in Reported Link Load Between a 5-Minute Average View and a 5 ms Microburst View.

BQM Deployment in the Trading Network.

Figure 9 shows a typical BQM deployment in a trading network.

Figure 9 Typical BQM Deployment in a Trading Network.

BQM can then be used to answer these types of questions:

• Are any of my Gigabit LAN core links saturated for more than X milliseconds? Is this causing loss? Which links would most benefit from an upgrade to Etherchannel or 10 Gigabit speeds?

• What application traffic is causing the saturation of my 1 Gigabit links?

• Is any of the market data experiencing end-to-end loss?

• How much additional latency does the failover data center experience? Is this link sized correctly to deal with microbursts?

• Are my traders getting low latency updates from the market data distribution layer? Are they seeing any delays greater than X milliseconds?

Being able to answer these questions simply and effectively saves time and money in running the trading network.

BQM is an essential tool for gaining visibility in market data and trading environments. It provides granular end-to-end latency measurements in complex infrastructures that experience high-volume data movement. Effectively detecting microbursts in sub-millisecond levels and receiving expert analysis on a particular event is invaluable to trading floor architects. Smart bandwidth provisioning recommendations, such as sizing and what-if analysis, provide greater agility to respond to volatile market conditions. As the explosion of algorithmic trading and increasing message rates continues, BQM, combined with its QoS tool, provides the capability of implementing QoS policies that can protect critical trading applications.

Cisco Financial Services Latency Monitoring Solution.

Cisco and Trading Metrics have collaborated on latency monitoring solutions for FIX order flow and market data monitoring. Cisco AON technology is the foundation for a new class of network-embedded products and solutions that help merge intelligent networks with application infrastructure, based on either service-oriented or traditional architectures. Trading Metrics is a leading provider of analytics software for network infrastructure and application latency monitoring purposes (tradingmetrics/).

The Cisco AON Financial Services Latency Monitoring Solution (FSMS) correlated two kinds of events at the point of observation:

• Network events correlated directly with coincident application message handling.

• Trade order flow and matching market update events.

Using time stamps asserted at the point of capture in the network, real-time analysis of these correlated data streams permits precise identification of bottlenecks across the infrastructure while a trade is being executed or market data is being distributed. By monitoring and measuring latency early in the cycle, financial companies can make better decisions about which network service—and which intermediary, market, or counterparty—to select for routing trade orders. Likewise, this knowledge allows more streamlined access to updated market data (stock quotes, economic news, etc.), which is an important basis for initiating, withdrawing from, or pursuing market opportunities.

The components of the solution are:

• AON hardware in three form factors:

– AON Network Module for Cisco 2600/2800/3700/3800 routers.

– AON Blade for the Cisco Catalyst 6500 series.

– AON 8340 Appliance.

• Trading Metrics M&A 2.0 software, which provides the monitoring and alerting application, displays latency graphs on a dashboard, and issues alerts when slowdowns occur (tradingmetrics/TM_brochure. pdf).

Figure 10 AON-Based FIX Latency Monitoring.

Cisco IP SLA.

Cisco IP SLA is an embedded network management tool in Cisco IOS which allows routers and switches to generate synthetic traffic streams which can be measured for latency, jitter, packet loss, and other criteria (cisco/go/ipsla).

Two key concepts are the source of the generated traffic and the target. Both of these run an IP SLA "responder," which has the responsibility to timestamp the control traffic before it is sourced and returned by the target (for a round trip measurement). Various traffic types can be sourced within IP SLA and they are aimed at different metrics and target different services and applications. The UDP jitter operation is used to measure one-way and round-trip delay and report variations. As the traffic is time stamped on both sending and target devices using the responder capability, the round trip delay is characterized as the delta between the two timestamps.

A new feature was introduced in IOS 12.3(14)T, IP SLA Sub Millisecond Reporting, which allows for timestamps to be displayed with a resolution in microseconds, thus providing a level of granularity not previously available. This new feature has now made IP SLA relevant to campus networks where network latency is typically in the range of 300-800 microseconds and the ability to detect trends and spikes (brief trends) based on microsecond granularity counters is a requirement for customers engaged in time-sensitive electronic trading environments.

As a result, IP SLA is now being considered by significant numbers of financial organizations as they are all faced with requirements to:

• Report baseline latency to their users.

• Trend baseline latency over time.

• Respond quickly to traffic bursts that cause changes in the reported latency.

Sub-millisecond reporting is necessary for these customers, since many campus and backbones are currently delivering under a second of latency across several switch hops. Electronic trading environments have generally worked to eliminate or minimize all areas of device and network latency to deliver rapid order fulfillment to the business. Reporting that network response times are "just under one millisecond" is no longer sufficient; the granularity of latency measurements reported across a network segment or backbone need to be closer to 300-800 micro-seconds with a degree of resolution of 100 ì seconds.

IP SLA recently added support for IP multicast test streams, which can measure market data latency.

A typical network topology is shown in Figure 11 with the IP SLA shadow routers, sources, and responders.

Figure 11 IP SLA Deployment.

Computing Services.

Computing services cover a wide range of technologies with the goal of eliminating memory and CPU bottlenecks created by the processing of network packets. Trading applications consume high volumes of market data and the servers have to dedicate resources to processing network traffic instead of application processing.

• Transport processing—At high speeds, network packet processing can consume a significant amount of server CPU cycles and memory. An established rule of thumb states that 1Gbps of network bandwidth requires 1 GHz of processor capacity (source Intel white paper on I/O acceleration intel/technology/ioacceleration/306517.pdf).

• Intermediate buffer copying—In a conventional network stack implementation, data needs to be copied by the CPU between network buffers and application buffers. This overhead is worsened by the fact that memory speeds have not kept up with increases in CPU speeds. For example, processors like the Intel Xeon are approaching 4 GHz, while RAM chips hover around 400MHz (for DDR 3200 memory) (source Intel intel/technology/ioacceleration/306517.pdf).

• Context switching—Every time an individual packet needs to be processed, the CPU performs a context switch from application context to network traffic context. This overhead could be reduced if the switch would occur only when the whole application buffer is complete.

Figure 12 Sources of Overhead in Data Center Servers.

• TCP Offload Engine (TOE)—Offloads transport processor cycles to the NIC. Moves TCP/IP protocol stack buffer copies from system memory to NIC memory.

• Remote Direct Memory Access (RDMA)—Enables a network adapter to transfer data directly from application to application without involving the operating system. Eliminates intermediate and application buffer copies (memory bandwidth consumption).

• Kernel bypass — Direct user-level access to hardware. Dramatically reduces application context switches.

Figure 13 RDMA and Kernel Bypass.

InfiniBand is a point-to-point (switched fabric) bidirectional serial communication link which implements RDMA, among other features. Cisco offers an InfiniBand switch, the Server Fabric Switch (SFS): cisco/application/pdf/en/us/guest/netsol/ns500/c643/cdccont_0900aecd804c35cb. pdf.

Figure 14 Typical SFS Deployment.

Trading applications benefit from the reduction in latency and latency variability, as proved by a test performed with the Cisco SFS and Wombat Feed Handlers by Stac Research:

Application Virtualization Service.

De-coupling the application from the underlying OS and server hardware enables them to run as network services. One application can be run in parallel on multiple servers, or multiple applications can be run on the same server, as the best resource allocation dictates. This decoupling enables better load balancing and disaster recovery for business continuance strategies. The process of re-allocating computing resources to an application is dynamic. Using an application virtualization system like Data Synapse's GridServer, applications can migrate, using pre-configured policies, to under-utilized servers in a supply-matches-demand process (wwwworkworld/supp/2005/ndc1/022105virtual. html? page=2).

There are many business advantages for financial firms who adopt application virtualization:

• Faster time to market for new products and services.

• Faster integration of firms following merger and acquisition activity.

• Increased application availability.

• Better workload distribution, which creates more "head room" for processing spikes in trading volume.

• Operational efficiency and control.

• Reduction in IT complexity.

Currently, application virtualization is not used in the trading front-office. One use-case is risk modeling, like Monte Carlo simulations. As the technology evolves, it is conceivable that some the trading platforms will adopt it.

Data Virtualization Service.

To effectively share resources across distributed enterprise applications, firms must be able to leverage data across multiple sources in real-time while ensuring data integrity. With solutions from data virtualization software vendors such as Gemstone or Tangosol (now Oracle), financial firms can access heterogeneous sources of data as a single system image that enables connectivity between business processes and unrestrained application access to distributed caching. The net result is that all users have instant access to these data resources across a distributed network (gridtoday/03/0210/101061.html).

This is called a data grid and is the first step in the process of creating what Gartner calls Extreme Transaction Processing (XTP) (gartner/DisplayDocument? ref=g_search&id=500947). Technologies such as data and applications virtualization enable financial firms to perform real-time complex analytics, event-driven applications, and dynamic resource allocation.

One example of data virtualization in action is a global order book application. An order book is the repository of active orders that is published by the exchange or other market makers. A global order book aggregates orders from around the world from markets that operate independently. The biggest challenge for the application is scalability over WAN connectivity because it has to maintain state. Today's data grids are localized in data centers connected by Metro Area Networks (MAN). This is mainly because the applications themselves have limits—they have been developed without the WAN in mind.

Figure 15 GemStone GemFire Distributed Caching.

Before data virtualization, applications used database clustering for failover and scalability. This solution is limited by the performance of the underlying database. Failover is slower because the data is committed to disc. With data grids, the data which is part of the active state is cached in memory, which reduces drastically the failover time. Scaling the data grid means just adding more distributed resources, providing a more deterministic performance compared to a database cluster.

Multicast Service.

Market data delivery is a perfect example of an application that needs to deliver the same data stream to hundreds and potentially thousands of end users. Market data services have been implemented with TCP or UDP broadcast as the network layer, but those implementations have limited scalability. Using TCP requires a separate socket and sliding window on the server for each recipient. UDP broadcast requires a separate copy of the stream for each destination subnet. Both of these methods exhaust the resources of the servers and the network. The server side must transmit and service each of the streams individually, which requires larger and larger server farms. On the network side, the required bandwidth for the application increases in a linear fashion. For example, to send a 1 Mbps stream to 1000recipients using TCP requires 1 Gbps of bandwidth.

IP multicast is the only way to scale market data delivery. To deliver a 1 Mbps stream to 1000 recipients, IP multicast would require 1 Mbps. The stream can be delivered by as few as two servers—one primary and one backup for redundancy.

There are two main phases of market data delivery to the end user. In the first phase, the data stream must be brought from the exchange into the brokerage's network. Typically the feeds are terminated in a data center on the customer premise. The feeds are then processed by a feed handler, which may normalize the data stream into a common format and then republish into the application messaging servers in the data center.

The second phase involves injecting the data stream into the application messaging bus which feeds the core infrastructure of the trading applications. The large brokerage houses have thousands of applications that use the market data streams for various purposes, such as live trades, long term trending, arbitrage, etc. Many of these applications listen to the feeds and then republish their own analytical and derivative information. For example, a brokerage may compare the prices of CSCO to the option prices of CSCO on another exchange and then publish ratings which a different application may monitor to determine how much they are out of synchronization.

Figure 16 Market Data Distribution Players.

The delivery of these data streams is typically over a reliable multicast transport protocol, traditionally Tibco Rendezvous. Tibco RV operates in a publish and subscribe environment. Each financial instrument is given a subject name, such as CSCO. last. Each application server can request the individual instruments of interest by their subject name and receive just a that subset of the information. This is called subject-based forwarding or filtering. Subject-based filtering is patented by Tibco.

A distinction should be made between the first and second phases of market data delivery. The delivery of market data from the exchange to the brokerage is mostly a one-to-many application. The only exception to the unidirectional nature of market data may be retransmission requests, which are usually sent using unicast. The trading applications, however, are definitely many-to-many applications and may interact with the exchanges to place orders.

Figure 17 Market Data Architecture.

Design Issues.

Number of Groups/Channels to Use.

Many application developers consider using thousand of multicast groups to give them the ability to divide up products or instruments into small buckets. Normally these applications send many small messages as part of their information bus. Usually several messages are sent in each packet that are received by many users. Sending fewer messages in each packet increases the overhead necessary for each message.

In the extreme case, sending only one message in each packet quickly reaches the point of diminishing returns—there is more overhead sent than actual data. Application developers must find a reasonable compromise between the number of groups and breaking up their products into logical buckets.

Consider, for example, the Nasdaq Quotation Dissemination Service (NQDS). The instruments are broken up alphabetically:

Another example is the Nasdaq Totalview service, broken up this way:

This approach allows for straight forward network/application management, but does not necessarily allow for optimized bandwidth utilization for most users. A user of NQDS that is interested in technology stocks, and would like to subscribe to just CSCO and INTL, would have to pull down all the data for the first two groups of NQDS. Understanding the way users pull down the data and then organize it into appropriate logical groups optimizes the bandwidth for each user.

In many market data applications, optimizing the data organization would be of limited value. Typically customers bring in all data into a few machines and filter the instruments. Using more groups is just more overhead for the stack and does not help the customers conserve bandwidth. Another approach might be to keep the groups down to a minimum level and use UDP port numbers to further differentiate if necessary. The other extreme would be to use just one multicast group for the entire application and then have the end user filter the data. In some situations this may be sufficient.

Intermittent Sources.

A common issue with market data applications are servers that send data to a multicast group and then go silent for more than 3.5 minutes. These intermittent sources may cause trashing of state on the network and can introduce packet loss during the window of time when soft state and then hardware shorts are being created.

PIM-Bidir or PIM-SSM.

The first and best solution for intermittent sources is to use PIM-Bidir for many-to-many applications and PIM-SSM for one-to-many applications.

Both of these optimizations of the PIM protocol do not have any data-driven events in creating forwarding state. That means that as long as the receivers are subscribed to the streams, the network has the forwarding state created in the hardware switching path.

Intermittent sources are not an issue with PIM-Bidir and PIM-SSM.

Null Packets.

In PIM-SM environments a common method to make sure forwarding state is created is to send a burst of null packets to the multicast group before the actual data stream. The application must efficiently ignore these null data packets to ensure it does not affect performance. The sources must only send the burst of packets if they have been silent for more than 3 minutes. A good practice is to send the burst if the source is silent for more than a minute. Many financials send out an initial burst of traffic in the morning and then all well-behaved sources do not have problems.

Periodic Keepalives or Heartbeats.

An alternative approach for PIM-SM environments is for sources to send periodic heartbeat messages to the multicast groups. This is a similar approach to the null packets, but the packets can be sent on a regular timer so that the forwarding state never expires.

S, G Expiry Timer.

Finally, Cisco has made a modification to the operation of the S, G expiry timer in IOS. There is now a CLI knob to allow the state for a S, G to stay alive for hours without any traffic being sent. The (S, G) expiry timer is configurable. This approach should be considered a workaround until PIM-Bidir or PIM-SSM is deployed or the application is fixed.

RTCP Feedback.

A common issue with real time voice and video applications that use RTP is the use of RTCP feedback traffic. Unnecessary use of the feedback option can create excessive multicast state in the network. If the RTCP traffic is not required by the application it should be avoided.

Fast Producers and Slow Consumers.

Today many servers providing market data are attached at Gigabit speeds, while the receivers are attached at different speeds, usually 100Mbps. This creates the potential for receivers to drop packets and request re-transmissions, which creates more traffic that the slowest consumers cannot handle, continuing the vicious circle.

The solution needs to be some type of access control in the application that limits the amount of data that one host can request. QoS and other network functions can mitigate the problem, but ultimately the subscriptions need to be managed in the application.

Tibco Heartbeats.

TibcoRV has had the ability to use IP multicast for the heartbeat between the TICs for many years. However, there are some brokerage houses that are still using very old versions of TibcoRV that use UDP broadcast support for the resiliency. This limitation is often cited as a reason to maintain a Layer 2 infrastructure between TICs located in different data centers. These older versions of TibcoRV should be phased out in favor of the IP multicast supported versions.

Multicast Forwarding Options.

PIM Sparse Mode.

The standard IP multicast forwarding protocol used today for market data delivery is PIM Sparse Mode. It is supported on all Cisco routers and switches and is well understood. PIM-SM can be used in all the network components from the exchange, FSP, and brokerage.

There are, however, some long-standing issues and unnecessary complexity associated with a PIM-SM deployment that could be avoided by using PIM-Bidir and PIM-SSM. These are covered in the next sections.

The main components of the PIM-SM implementation are:

• PIM Sparse Mode v2.

• Shared Tree (spt-threshold infinity)

A design option in the brokerage or in the exchange.

Details of Anycast RP can be found in:

The classic high availability design for Tibco in the brokerage network is documented in:

Bidirectional PIM.

PIM-Bidir is an optimization of PIM Sparse Mode for many-to-many applications. It has several key advantages over a PIM-SM deployment:

• Better support for intermittent sources.

• No data-triggered events.

One of the weaknesses of PIM-SM is that the network continually needs to react to active data flows. This can cause non-deterministic behavior that may be hard to troubleshoot. PIM-Bidir has the following major protocol differences over PIM-SM:

– No source registration.

Source traffic is automatically sent to the RP and then down to the interested receivers. There is no unicast encapsulation, PIM joins from the RP to the first hop router and then registration stop messages.

All PIM-Bidir traffic is forwarded on a *,G forwarding entry. The router does not have to monitor the traffic flow on a *,G and then send joins when the traffic passes a threshold.

– No need for an actual RP.

The RP does not have an actual protocol function in PIM-Bidir. The RP acts as a routing vector in which all the traffic converges. The RP can be configured as an address that is not assigned to any particular device. This is called a Phantom RP.

– No need for MSDP.

MSDP provides source information between RPs in a PIM-SM network. PIM-Bidir does not use the active source information for any forwarding decisions and therefore MSDP is not required.

Bidirectional PIM is ideally suited for the brokerage network in the data center of the exchange. In this environment there are many sources sending to a relatively few set of groups in a many-to-many traffic pattern.

The key components of the PIM-Bidir implementation are:

Further details about Phantom RP and basic PIM-Bidir design are documented in:

Source Specific Multicast.

PIM-SSM is an optimization of PIM Sparse Mode for one-to-many applications. In certain environments it can offer several distinct advantages over PIM-SM. Like PIM-Bidir, PIM-SSM does not rely on any data-triggered events. Furthermore, PIM-SSM does not require an RP at all—there is no such concept in PIM-SSM. The forwarding information in the network is completely controlled by the interest of the receivers.

Source Specific Multicast is ideally suited for market data delivery in the financial service provider. The FSP can receive the feeds from the exchanges and then route them to the end of their network.

Many FSPs are also implementing MPLS and Multicast VPNs in their core. PIM-SSM is the preferred method for transporting traffic in VRFs.

When PIM-SSM is deployed all the way to the end user, the receiver indicates his interest in a particular S, G with IGMPv3. Even though IGMPv3 was defined by RFC 2236 back in October, 2002, it still has not been implemented by all edge devices. This creates a challenge for deploying an end-to-end PIM-SSM service. A transitional solution has been developed by Cisco to enable an edge device that supports IGMPv2 to participate in an PIM-SSM service. This feature is called SSM Mapping and is documented in:

Storage Services.

The service provides storage capabilities into the market data and trading environments. Trading applications access backend storage to connect to different databases and other repositories consisting of portfolios, trade settlements, compliance data, management applications, Enterprise Service Bus (ESB), and other critical applications where reliability and security is critical to the success of the business. The main requirements for the service are:

Storage virtualization is an enabling technology that simplifies management of complex infrastructures, enables non-disruptive operations, and facilitates critical elements of a proactive information lifecycle management (ILM) strategy. EMC Invista running on the Cisco MDS 9000 enables heterogeneous storage pooling and dynamic storage provisioning, allowing allocation of any storage to any application. High availability is increased with seamless data migration. Appropriate class of storage is allocated to point-in-time copies (clones). Storage virtualization is also leveraged through the use of Virtual Storage Area Networks (VSANs), which enable the consolidation of multiple isolated SANs onto a single physical SAN infrastructure, while still partitioning them as completely separate logical entities. VSANs provide all the security and fabric services of traditional SANs, yet give organizations the flexibility to easily move resources from one VSAN to another. This results in increased disk and network utilization while driving down the cost of management. Integrated Inter VSAN Routing (IVR) enables sharing of common resources across VSANs.

Figure 18 High Performance Computing Storage.

Replication of data to a secondary and tertiary data center is crucial for business continuance. Replication offsite over Fiber Channel over IP (FCIP) coupled with write acceleration and tape acceleration provides improved performance over long distance. Continuous Data Replication (CDP) is another mechanism which is gaining popularity in the industry. It refers to backup of computer data by automatically saving a copy of every change made to that data, essentially capturing every version of the data that the user saves. It allows the user or administrator to restore data to any point in time. Solutions from EMC and Incipient utilize the SANTap protocol on the Storage Services Module (SSM) in the MDS platform to provide CDP functionality. The SSM uses the SANTap service to intercept and redirect a copy of a write between a given initiator and target. The appliance does not reside in the data path—it is completely passive. The CDP solutions typically leverage a history journal that tracks all changes and bookmarks that identify application-specific events. This ensures that data at any point in time is fully self-consistent and is recoverable instantly in the event of a site failure.

Backup procedure reliability and performance are extremely important when storing critical financial data to a SAN. The use of expensive media servers to move data from disk to tape devices can be cumbersome. Network-accelerated serverless backup (NASB) helps you back up increased amounts of data in shorter backup time frames by shifting the data movement from multiple backup servers to Cisco MDS 9000 Series multilayer switches. This technology decreases impact on application servers because the MDS offloads the application and backup servers. It also reduces the number of backup and media servers required, thus reducing CAPEX and OPEX. The flexibility of the backup environment increases because storage and tape drives can reside anywhere on the SAN.

Trading Resilience and Mobility.

The main requirements for this service are to provide the virtual trader:

• Fully scalable and redundant campus trading environment.

• Resilient server load balancing and high availability in analytic server farms.

• Global site load balancing that provide the capability to continue participating in the market venues of closest proximity.

A highly-available campus environment is capable of sustaining multiple failures (i. e., links, switches, modules, etc.), which provides non-disruptive access to trading systems for traders and market data feeds. Fine-tuned routing protocol timers, in conjunction with mechanisms such as NSF/SSO, provide subsecond recovery from any failure.

The high-speed interconnect between data centers can be DWDM/dark fiber, which provides business continuance in case of a site failure. Each site is 100km-200km apart, allowing synchronous data replication. Usually the distance for synchronous data replication is 100km, but with Read/Write Acceleration it can stretch to 200km. A tertiary data center can be greater than 200km away, which would replicate data in an asynchronous fashion.

Figure 19 Trading Resilience.

A robust server load balancing solution is required for order routing, algorithmic trading, risk analysis, and other services to offer continuous access to clients regardless of a server failure. Multiple servers encompass a "farm" and these hosts can added/removed without disruption since they reside behind a virtual IP (VIP) address which is announced in the network.

A global site load balancing solution provides remote traders the resiliency to access trading environments which are closer to their location. This minimizes latency for execution times since requests are always routed to the nearest venue.

Figure 20 Virtualization of Trading Environment.

A trading environment can be virtualized to provide segmentation and resiliency in complex architectures. Figure 20 illustrates a high-level topology depicting multiple market data feeds entering the environment, whereby each vendor is assigned its own Virtual Routing and Forwarding (VRF) instance. The market data is transferred to a high-speed InfiniBand low-latency compute fabric where feed handlers, order routing systems, and algorithmic trading systems reside. All storage is accessed via a SAN and is also virtualized with VSANs, allowing further security and segmentation. The normalized data from the compute fabric is transferred to the campus trading environment where the trading desks reside.

Wide Area Application Services.

This service provides application acceleration and optimization capabilities for traders who are located outside of the core trading floor facility/data center and working from a remote office. To consolidate servers and increase security in remote offices, file servers, NAS filers, storage arrays, and tape drives are moved to a corporate data center to increase security and regulatory compliance and facilitate centralized storage and archival management. As the traditional trading floor is becoming more virtual, wide area application services technology is being utilized to provide a "LAN-like" experience to remote traders when they access resources at the corporate site. Traders often utilize Microsoft Office applications, especially Excel in addition to Sharepoint and Exchange. Excel is used heavily for modeling and permutations where sometime only small portions of the file are changed. CIFS protocol is notoriously known to be "chatty," where several message normally traverse the WAN for a simple file operation and it is addressed by Wide Area Application Service (WAAS) technology. Bloomberg and Reuters applications are also very popular financial tools which access a centralized SAN or NAS filer to retrieve critical data which is fused together before represented to a trader's screen.

Figure 21 Wide Area Optimization.

A pair of Wide Area Application Engines (WAEs) that reside in the remote office and the data center provide local object caching to increase application performance. The remote office WAEs can be a module in the ISR router or a stand-alone appliance. The data center WAE devices are load balanced behind an Application Control Engine module installed in a pair of Catalyst 6500 series switches at the aggregation layer. The WAE appliance farm is represented by a virtual IP address. The local router in each site utilizes Web Cache Communication Protocol version 2 (WCCP v2) to redirect traffic to the WAE that intercepts the traffic and determines if there is a cache hit or miss. The content is served locally from the engine if it resides in cache; otherwise the request is sent across the WAN the initial time to retrieve the object. This methodology optimizes the trader experience by removing application latency and shielding the individual from any congestion in the WAN.

WAAS uses the following technologies to provide application acceleration:

• Data Redundancy Elimination (DRE) is an advanced form of network compression which allows the WAE to maintain a history of previously-seen TCP message traffic for the purposes of reducing redundancy found in network traffic. This combined with the Lempel-Ziv (LZ) compression algorithm reduces the number of redundant packets that traverse the WAN, which improves application transaction performance and conserves bandwidth.

• Transport Flow Optimization (TFO) employs a robust TCP proxy to safely optimize TCP at the WAE device by applying TCP-compliant optimizations to shield the clients and servers from poor TCP behavior because of WAN conditions. By running a TCP proxy between the devices and leveraging an optimized TCP stack between the devices, many of the problems that occur in the WAN are completely blocked from propagating back to trader desktops. The traders experience LAN-like TCP response times and behavior because the WAE is terminating TCP locally. TFO improves reliability and throughput through increases in TCP window scaling and sizing enhancements in addition to superior congestion management.

Thin Client Service.

This service provides a "thin" advanced trading desktop which delivers significant advantages to demanding trading floor environments requiring continuous growth in compute power. As financial institutions race to provide the best trade executions for their clients, traders are utilizing several simultaneous critical applications that facilitate complex transactions. It is not uncommon to find three or more workstations and monitors at a trader's desk which provide visibility into market liquidity, trading venues, news, analysis of complex portfolio simulations, and other financial tools. In addition, market dynamics continue to evolve with Direct Market Access (DMA), ECNs, alternative trading volumes, and upcoming regulation changes with Regulation National Market System (RegNMS) in the US and Markets in Financial Instruments Directive (MiFID) in Europe. At the same time, business seeks greater control, improved ROI, and additional flexibility, which creates greater demands on trading floor infrastructures.

Traders no longer require multiple workstations at their desk. Thin clients consist of keyboard, mouse, and multi-displays which provide a total trader desktop solution without compromising security. Hewlett Packard, Citrix, Desktone, Wyse, and other vendors provide thin client solutions to capitalize on the virtual desktop paradigm. Thin clients de-couple the user-facing hardware from the processing hardware, thus enabling IT to grow the processing power without changing anything on the end user side. The workstation computing power is stored in the data center on blade workstations, which provide greater scalability, increased data security, improved business continuance across multiple sites, and reduction in OPEX by removing the need to manage individual workstations on the trading floor. One blade workstation can be dedicated to a trader or shared among multiple traders depending on the requirements for computer power.

The "thin client" solution is optimized to work in a campus LAN environment, but can also extend the benefits to traders in remote locations. Latency is always a concern when there is a WAN interconnecting the blade workstation and thin client devices. The network connection needs to be sized accordingly so traffic is not dropped if saturation points exist in the WAN topology. WAN Quality of Service (QoS) should prioritize sensitive traffic. There are some guidelines which should be followed to allow for an optimized user experience. A typical highly-interactive desktop experience requires a client-to-blade round trip latency of <20ms for a 2Kb packet size. There may be a slight lag in display if network latency is between 20ms to 40ms. A typical trader desk with a four multi-display terminal requires 2-3Mbps bandwidth consumption with seamless communication with blade workstation(s) in the data center. Streaming video (800x600 at 24fps/full color) requires 9 Mbps bandwidth usage.

Figure 22 Thin Client Architecture.

Management of a large thin client environment is simplified since a centralized IT staff manages all of the blade workstations dispersed across multiple data centers. A trader is redirected to the most available environment in the enterprise in the event of a particular site failure. High availability is a key concern in critical financial environments and the Blade Workstation design provides rapid provisioning of another blade workstation in the data center. This resiliency provides greater uptime, increases in productivity, and OpEx reduction.

Advanced Encryption Standard.

Advanced Message Queueing Protocol.

Application Oriented Networking.

The Archipelago® Integrated Web book gives investors the unique opportunity to view the entire ArcaEx and ArcaEdge books in addition to books made available by other market participants.

ECN Order Book feed available via NASDAQ.

시카고 무역위원회.

Class-Based Weighted Fair Queueing.

Continuous Data Replication.

Chicago Mercantile Exchange is engaged in trading of futures contracts and derivatives.

Central Processing Unit.

Distributed Defect Tracking System.

Direct Market Access.

Data Redundancy Elimination.

Dense Wavelength Division Multiplexing.

전자 통신 네트워크.

Enterprise Service Bus.

Enterprise Solutions Engineering.

FIX Adapted for Streaming.

Fibre Channel over IP.

Financial Information Exchange.

Financial Services Latency Monitoring Solution.

Financial Service Provider.

Information Lifecycle Management.

Instinet Island Book.

Internetworking Operating System.

Keyboard Video Mouse.

Low Latency Queueing.

Metro Area Network.

Multilayer Director Switch.

Markets in Financial Instruments Directive.

Message Passing Interface is an industry standard specifying a library of functions to enable the passing of messages between nodes within a parallel computing environment.

Network Attached Storage.

Network Accelerated Serverless Backup.

Network Interface Card.

Nasdaq Quotation Dissemination Service.

Order Management System.

Open Systems Interconnection.

Protocol Independent Multicast.

PIM-Source Specific Multicast.

서비스 품질.

Random Access Memory.

Reuters Data Feed.

Reuters Data Feed Direct.

Remote Direct Memory Access.

Regulation National Market System.

Remote Graphics Software.

Reuters Market Data System.

RTP Control Protocol.

Real Time Protocol.

Reuters Wire Format.

Storage Area Network.

Small Computer System Interface.

Sockets Direct Protocol—Given that many modern applications are written using the sockets API, SDP can intercept the sockets at the kernel level and map these socket calls to an InfiniBand transport service that uses RDMA operations to offload data movement from the CPU to the HCA hardware.

Server Fabric Switch.

Secure Financial Transaction Infrastructure network developed to provide firms with excellent communication paths to NYSE Group, AMEX, Chicago Stock Exchange, NASDAQ, and other exchanges. It is often used for order routing.

외환 성남시

Saturday, 10 March 2018

낮은 대기 시간 거래 시스템 아키텍처

No comments:

Post a Comment